Full Code of ContinuumIO/gtc2017-numba for AI

master 6ddaeec9baec cached

15 files

247.5 KB

102.6k tokens

5 symbols

1 requests

Download .txt

Showing preview only (256K chars total). Download the full file or copy to clipboard to get everything.

Repository: ContinuumIO/gtc2017-numba
Branch: master
Commit: 6ddaeec9baec
Files: 15
Total size: 247.5 KB

Directory structure:
gitextract_ini3svql/

├── 1 - Numba Basics.ipynb
├── 2 - CUDA Basics.ipynb
├── 3 - Memory Management.ipynb
├── 4 - Writing CUDA Kernels.ipynb
├── 5 - Troubleshooting and Debugging.ipynb
├── 6 - Extra Topics.ipynb
├── README.md
├── debug/
│   ├── ex1.py
│   ├── ex1a.py
│   ├── ex2.py
│   ├── ex3.py
│   └── ex3a.py
└── docker/
    ├── README.md
    ├── base/
    │   └── Dockerfile
    └── notebooks/
        └── Dockerfile

================================================
FILE CONTENTS
================================================

================================================
FILE: 1 - Numba Basics.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 1: Numba Basics\n",
    "\n",
    "## What is Numba?\n",
    "\n",
    "Numba is a **just-in-time**, **type-specializing**, **function compiler** for accelerating **numerically-focused** Python.  That's a long list, so let's break down those terms:\n",
    "\n",
    " * **function compiler**: Numba compiles Python functions, not entire applications, and not parts of functions.  Numba does not replace your Python interpreter, but is just another Python module that can turn a function into a (usually) faster function. \n",
    " * **type-specializing**: Numba speeds up your function by generating a specialized implementation for the specific data types you are using.  Python functions are designed to operate on generic data types, which makes them very flexible, but also very slow.  In practice, you only will call a function with a small number of argument types, so Numba will generate a fast implementation for each set of types.\n",
    " * **just-in-time**: Numba translates functions when they are first called.  This ensures the compiler knows what argument types you will be using.  This also allows Numba to be used interactively in a Jupyter notebook just as easily as a traditional application\n",
    " * **numerically-focused**: Currently, Numba is focused on numerical data types, like `int`, `float`, and `complex`.  There is very limited string processing support, and many string use cases are not going to work well on the GPU.  To get best results with Numba, you will likely be using NumPy arrays.\n",
    "\n",
    "## Requirements\n",
    "\n",
    "Numba supports a wide range of operating systems:\n",
    "\n",
    " * Windows 7 and later, 32 and 64-bit\n",
    " * macOS 10.9 and later, 64-bit\n",
    " * Linux (most anything >= RHEL 5), 32-bit and 64-bit\n",
    "\n",
    "and Python versions:\n",
    "\n",
    " * Python 2.7, 3.3-3.6\n",
    " * NumPy 1.8 and later\n",
    "\n",
    "and a very wide range of hardware:\n",
    "\n",
    "* x86, x86_64/AMD64 CPUs\n",
    "* NVIDIA CUDA GPUs (Compute capability 3.0 and later, CUDA 7.5 and later)\n",
    "* AMD GPUs (experimental patches)\n",
    "* ARM (experimental patches)\n",
    "\n",
    "For this tutorial, we will be using Linux 64-bit and CUDA 8."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First Steps\n",
    "\n",
    "Let's write our first Numba function and compile it for the **CPU**.  The Numba compiler is typically enabled by applying a *decorator* to a Python function.  Decorators are functions that transform Python functions.  Here we will use the CPU compilation decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numba import jit\n",
    "import math\n",
    "\n",
    "@jit\n",
    "def hypot(x, y):\n",
    "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
    "    x = abs(x);\n",
    "    y = abs(y);\n",
    "    t = min(x, y);\n",
    "    x = max(x, y);\n",
    "    t = t / x;\n",
    "    return x * math.sqrt(1+t*t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code is equivalent to writing:\n",
    "``` python\n",
    "def hypot(x, y):\n",
    "    x = abs(x);\n",
    "    y = abs(y);\n",
    "    t = min(x, y);\n",
    "    x = max(x, y);\n",
    "    t = t / x;\n",
    "    return x * math.sqrt(1+t*t)\n",
    "    \n",
    "hypot = jit(hypot)\n",
    "```\n",
    "This means that the Numba compiler is just a function you can call whenever you want!\n",
    "\n",
    "Let's try out our hypotenuse calculation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hypot(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first time we call `hypot`, the compiler is triggered and compiles a machine code implementation for float inputs.  Numba also saves the original Python implementation of the function in the `.py_func` attribute, so we can call the original Python code to make sure we get the same answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hypot.py_func(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmarking\n",
    "\n",
    "An important part of using Numba is measuring the performance of your new code.  Let's see if we actually sped anything up.  The easiest way to do this in the Jupyter notebook is to use the `%timeit` magic function.  Let's first measure the speed of the original Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 6.56 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000000 loops, best of 3: 893 ns per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit hypot.py_func(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `%timeit` magic runs the statement many times to get an accurate estimate of the run time.  It also returns the best time by default, which is useful to reduce the probability that random background events affect your measurement.  The best of 3 approach also ensures that the compilation time on the first call doesn't skew the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 24.98 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 176 ns per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit hypot(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numba did a pretty good job with this function.  It's more than 4x faster than the pure Python version.\n",
    "\n",
    "Of course, the `hypot` function is already present in the Python module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 72.16 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 138 ns per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit math.hypot(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python's built-in is even faster than Numba!  This is because Numba does introduce some overhead to each function call that is larger than the function call overhead of Python itself.  Extremely fast functions (like the above one) will be hurt by this.\n",
    "\n",
    "(However, if you call one Numba function from another one, there is very little function overhead, sometimes even zero if the compiler inlines the function into the other one.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How does Numba work?\n",
    "\n",
    "The first time we called our Numba-wrapped `hypot` function, the following process was initiated:\n",
    "\n",
    "![Numba Flowchart](img/numba_flowchart.png \"The compilation process\")\n",
    "\n",
    "We can see the result of type inference by using the `.inspect_types()` method, which prints an annotated version of the source code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hypot (float64, float64)\n",
      "--------------------------------------------------------------------------------\n",
      "# File: <ipython-input-3-d433ca923bcf>\n",
      "# --- LINE 4 --- \n",
      "# label 0\n",
      "#   del x\n",
      "#   del $0.1\n",
      "#   del $0.3\n",
      "#   del y\n",
      "#   del $0.4\n",
      "#   del $0.6\n",
      "#   del $0.7\n",
      "#   del $0.10\n",
      "#   del y.1\n",
      "#   del x.1\n",
      "#   del $0.11\n",
      "#   del $0.14\n",
      "#   del t\n",
      "#   del $0.17\n",
      "#   del $0.19\n",
      "#   del t.1\n",
      "#   del $const0.21\n",
      "#   del $0.24\n",
      "#   del $0.25\n",
      "#   del $0.20\n",
      "#   del x.2\n",
      "#   del $0.26\n",
      "#   del $0.27\n",
      "\n",
      "@jit\n",
      "\n",
      "# --- LINE 5 --- \n",
      "\n",
      "def hypot(x, y):\n",
      "\n",
      "    # --- LINE 6 --- \n",
      "\n",
      "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
      "\n",
      "    # --- LINE 7 --- \n",
      "    #   x = arg(0, name=x)  :: float64\n",
      "    #   y = arg(1, name=y)  :: float64\n",
      "    #   $0.1 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
      "    #   $0.3 = call $0.1(x)  :: (float64,) -> float64\n",
      "    #   x.1 = $0.3  :: float64\n",
      "\n",
      "    x = abs(x);\n",
      "\n",
      "    # --- LINE 8 --- \n",
      "    #   $0.4 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
      "    #   $0.6 = call $0.4(y)  :: (float64,) -> float64\n",
      "    #   y.1 = $0.6  :: float64\n",
      "\n",
      "    y = abs(y);\n",
      "\n",
      "    # --- LINE 9 --- \n",
      "    #   $0.7 = global(min: <built-in function min>)  :: Function(<built-in function min>)\n",
      "    #   $0.10 = call $0.7(x.1, y.1)  :: (float64, float64) -> float64\n",
      "    #   t = $0.10  :: float64\n",
      "\n",
      "    t = min(x, y);\n",
      "\n",
      "    # --- LINE 10 --- \n",
      "    #   $0.11 = global(max: <built-in function max>)  :: Function(<built-in function max>)\n",
      "    #   $0.14 = call $0.11(x.1, y.1)  :: (float64, float64) -> float64\n",
      "    #   x.2 = $0.14  :: float64\n",
      "\n",
      "    x = max(x, y);\n",
      "\n",
      "    # --- LINE 11 --- \n",
      "    #   $0.17 = t / x.2  :: float64\n",
      "    #   t.1 = $0.17  :: float64\n",
      "\n",
      "    t = t / x;\n",
      "\n",
      "    # --- LINE 12 --- \n",
      "    #   $0.19 = global(math: <module 'math' from '/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/math.cpython-36m-darwin.so'>)  :: Module(<module 'math' from '/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/math.cpython-36m-darwin.so'>)\n",
      "    #   $0.20 = getattr(value=$0.19, attr=sqrt)  :: Function(<built-in function sqrt>)\n",
      "    #   $const0.21 = const(int, 1)  :: int64\n",
      "    #   $0.24 = t.1 * t.1  :: float64\n",
      "    #   $0.25 = $const0.21 + $0.24  :: float64\n",
      "    #   $0.26 = call $0.20($0.25)  :: (float64,) -> float64\n",
      "    #   $0.27 = x.2 * $0.26  :: float64\n",
      "    #   $0.28 = cast(value=$0.27)  :: float64\n",
      "    #   return $0.28\n",
      "\n",
      "    return x * math.sqrt(1+t*t)\n",
      "\n",
      "\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "hypot.inspect_types()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that Numba's type names tend to mirror the NumPy type names, so a Python `float` is a `float64` (also called \"double precision\" in other languages).  Taking a look at the data types can sometimes be important in GPU code because the performance of `float32` and `float64` computations will be very different on CUDA devices.  An accidental upcast can dramatically slow down a function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When Things Go Wrong\n",
    "\n",
    "Numba cannot compile all Python code.  Some functions don't have a Numba-translation, and some kinds of Python types can't be efficiently compiled at all (yet).  For example, Numba does not support dictionaries (as of this tutorial):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'value'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@jit\n",
    "def cannot_compile(x):\n",
    "    return x['key']\n",
    "\n",
    "cannot_compile(dict(key='value'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait, what happened??  By default, Numba will fall back to a mode, called \"object mode,\" which does not do type-specialization.  Object mode exists to enable other Numba functionality, but in many cases, you want Numba to tell you if type inference fails.  You can force \"nopython mode\" (the other compilation mode) by passing arguments to the decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypingError",
     "evalue": "Failed at nopython (nopython frontend)\nInvalid usage of getitem with parameters (pyobject, const('key'))\n * parameterized\nFile \"<ipython-input-21-d3b98ca43e8a>\", line 3\n[1] During: typing of intrinsic-call at <ipython-input-21-d3b98ca43e8a> (3)\n[2] During: typing of static-get-item at <ipython-input-21-d3b98ca43e8a> (3)\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypingError\u001b[0m                               Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-21-d3b98ca43e8a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      3\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'key'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mcannot_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'value'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    328\u001b[0m                                 for i, err in failed_args))\n\u001b[1;32m    329\u001b[0m                 \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 330\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    331\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    332\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0minspect_llvm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msignature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    305\u001b[0m                 \u001b[0margtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypeof_pyval\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    306\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 307\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtuple\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margtypes\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    308\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTypingError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    309\u001b[0m             \u001b[0;31m# Intercept typing error that may be due to an argument\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36mcompile\u001b[0;34m(self, sig)\u001b[0m\n\u001b[1;32m    576\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    577\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_cache_misses\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0msig\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 578\u001b[0;31m                 \u001b[0mcres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compiler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_type\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    579\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_overload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcres\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    580\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_cache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave_overload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msig\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcres\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36mcompile\u001b[0;34m(self, args, return_type)\u001b[0m\n\u001b[1;32m     78\u001b[0m                                       \u001b[0mimpl\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     79\u001b[0m                                       \u001b[0margs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreturn_type\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 80\u001b[0;31m                                       flags=flags, locals=self.locals)\n\u001b[0m\u001b[1;32m     81\u001b[0m         \u001b[0;31m# Check typing error if object mode is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     82\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mcres\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtyping_error\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menable_pyobject\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mcompile_extra\u001b[0;34m(typingctx, targetctx, func, args, return_type, flags, locals, library)\u001b[0m\n\u001b[1;32m    702\u001b[0m     pipeline = Pipeline(typingctx, targetctx, library,\n\u001b[1;32m    703\u001b[0m                         args, return_type, flags, locals)\n\u001b[0;32m--> 704\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mpipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile_extra\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    705\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    706\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mcompile_extra\u001b[0;34m(self, func)\u001b[0m\n\u001b[1;32m    355\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlifted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    356\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlifted_from\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compile_bytecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    358\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    359\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mcompile_ir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc_ir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlifted\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlifted_from\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36m_compile_bytecode\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    663\u001b[0m         \"\"\"\n\u001b[1;32m    664\u001b[0m         \u001b[0;32massert\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfunc_ir\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 665\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compile_core\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    666\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    667\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_compile_ir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36m_compile_core\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    650\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    651\u001b[0m         \u001b[0mpm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfinalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 652\u001b[0;31m         \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    653\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    654\u001b[0m             \u001b[0;31m# Early pipeline completion\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, status)\u001b[0m\n\u001b[1;32m    241\u001b[0m                     \u001b[0;31m# No more fallback pipelines?\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    242\u001b[0m                     \u001b[0;32mif\u001b[0m \u001b[0mis_final_pipeline\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 243\u001b[0;31m                         \u001b[0;32mraise\u001b[0m \u001b[0mpatched_exception\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    244\u001b[0m                     \u001b[0;31m# Go to next fallback pipeline\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    245\u001b[0m                     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, status)\u001b[0m\n\u001b[1;32m    233\u001b[0m                 \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    234\u001b[0m                     \u001b[0mevent\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstage_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 235\u001b[0;31m                     \u001b[0mstage\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    236\u001b[0m                 \u001b[0;32mexcept\u001b[0m \u001b[0m_EarlyPipelineCompletion\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    237\u001b[0m                     \u001b[0;32mreturn\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mstage_nopython_frontend\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    447\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    448\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturn_type\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 449\u001b[0;31m                 self.locals)\n\u001b[0m\u001b[1;32m    450\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    451\u001b[0m         with self.fallback_context('Function \"%s\" has invalid return type'\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mtype_inference_stage\u001b[0;34m(typingctx, interp, args, return_type, locals)\u001b[0m\n\u001b[1;32m    803\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    804\u001b[0m         \u001b[0minfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuild_constraint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 805\u001b[0;31m         \u001b[0minfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpropagate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    806\u001b[0m         \u001b[0mtypemap\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrestype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcalltypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munify\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    807\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36mpropagate\u001b[0;34m(self, raise_errors)\u001b[0m\n\u001b[1;32m    765\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    766\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mraise_errors\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 767\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    768\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    769\u001b[0m                 \u001b[0;32mreturn\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36mpropagate\u001b[0;34m(self, typeinfer)\u001b[0m\n\u001b[1;32m    126\u001b[0m                                                    lineno=loc.line):\n\u001b[1;32m    127\u001b[0m                 \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 128\u001b[0;31m                     \u001b[0mconstraint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtypeinfer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    129\u001b[0m                 \u001b[0;32mexcept\u001b[0m \u001b[0mTypingError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    130\u001b[0m                     \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, typeinfer)\u001b[0m\n\u001b[1;32m    322\u001b[0m                     \u001b[0mtypeinfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mitemty\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    323\u001b[0m                 \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfallback\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 324\u001b[0;31m                     \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfallback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtypeinfer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    325\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    326\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mget_call_signature\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, typeinfer)\u001b[0m\n\u001b[1;32m    435\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtypeinfer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    436\u001b[0m         \u001b[0;32mwith\u001b[0m \u001b[0mnew_error_context\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"typing of intrinsic-call at {0}\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 437\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresolve\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtypeinfer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtypeinfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypevars\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfnty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    438\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    439\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36mresolve\u001b[0;34m(self, typeinfer, typevars, fnty)\u001b[0m\n\u001b[1;32m    399\u001b[0m             \u001b[0mdesc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_function_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfnty\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    400\u001b[0m             \u001b[0mmsg\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'\\n'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdesc\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 401\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mTypingError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    402\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    403\u001b[0m         \u001b[0mtypeinfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturn_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypingError\u001b[0m: Failed at nopython (nopython frontend)\nInvalid usage of getitem with parameters (pyobject, const('key'))\n * parameterized\nFile \"<ipython-input-21-d3b98ca43e8a>\", line 3\n[1] During: typing of intrinsic-call at <ipython-input-21-d3b98ca43e8a> (3)\n[2] During: typing of static-get-item at <ipython-input-21-d3b98ca43e8a> (3)\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n"
     ]
    }
   ],
   "source": [
    "@jit(nopython=True)\n",
    "def cannot_compile(x):\n",
    "    return x['key']\n",
    "\n",
    "cannot_compile(dict(key='value'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we get an exception when Numba tries to compile the function, with an error that says:\n",
    "```\n",
    "- argument 0: cannot determine Numba type of <class 'dict'>\n",
    "```\n",
    "which is the underlying problem.\n",
    "\n",
    "We will see other `@jit` decorator arguments in future sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "\n",
    "Below is a function that loops over two input NumPy arrays and puts their sum into the output array.  Modify this function to call the `hypot` function we defined above.  We will learn a more efficient way to write such functions in a future section.\n",
    "\n",
    "(Make sure to execute all the cells in this notebook so that `hypot` is defined.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@jit(nopython=True)\n",
    "def ex1(x, y, out):\n",
    "    for i in range(x.shape[0]):\n",
    "        out[i] = x[i] + y[i]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in1: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]\n",
      "in2: [  1.   3.   5.   7.   9.  11.  13.  15.  17.  19.]\n",
      "out: [  1.   4.   7.  10.  13.  16.  19.  22.  25.  28.]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "in1 = np.arange(10, dtype=np.float64)\n",
    "in2 = 2 * in1 + 1\n",
    "out = np.empty_like(in1)\n",
    "\n",
    "print('in1:', in1)\n",
    "print('in2:', in2)\n",
    "\n",
    "ex1(in1, in2, out)\n",
    "\n",
    "print('out:', out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This test will fail until you fix the ex1 function\n",
    "np.testing.assert_almost_equal(out, np.hypot(in1, in2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 2 - CUDA Basics.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 2: CUDA Basics\n",
    "\n",
    "There are two basic approaches to GPU programming in Numba:\n",
    "\n",
    " 1. ufuncs/gufuncs (subject of this section)\n",
    " 2. CUDA Python kernels (subject of next section)\n",
    " \n",
    "We will not go into the CUDA hardware too much in this tutorial, but the most important thing to remember is that the hardware is designed for *data parallelism*.  Maximum throughput is achieved when you are computing the same operations on many different elements at once.  \n",
    "\n",
    "Universal functions are naturally data parallel, so we will begin with them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Universal Functions\n",
    "\n",
    "NumPy has the concept of universal functions (\"ufuncs\"), which are functions that can take NumPy arrays of varying dimensions (or scalars) and operate on them element-by-element.\n",
    "\n",
    "It is probably easiest to show what happens by example.  We'll use the NumPy `add` ufunc to demonstrate what happens:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([11, 22, 33, 44])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "a = np.array([1, 2, 3, 4])\n",
    "b = np.array([10, 20, 30, 40])\n",
    "\n",
    "np.add(a, b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ufuncs also can combine scalars with arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([101, 102, 103, 104])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.add(a, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays of different, but compatible dimensions can also be combined.  The lower dimensional array will be replicated to match the dimensionality of the higher dimensional array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "c: [[ 0  1  2  3]\n",
      " [ 4  5  6  7]\n",
      " [ 8  9 10 11]\n",
      " [12 13 14 15]]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[10, 21, 32, 43],\n",
       "       [14, 25, 36, 47],\n",
       "       [18, 29, 40, 51],\n",
       "       [22, 33, 44, 55]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c = np.arange(4*4).reshape((4,4))\n",
    "print('c:', c)\n",
    "\n",
    "np.add(b, c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above situation, the `b` array is added to each row of `c`.  If we want to add `b` to each column, we need to transpose it.  There are several ways to do this, but one way is to insert a new axis using `np.newaxis`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10],\n",
       "       [20],\n",
       "       [30],\n",
       "       [40]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b_col = b[:, np.newaxis]\n",
    "b_col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10, 11, 12, 13],\n",
       "       [24, 25, 26, 27],\n",
       "       [38, 39, 40, 41],\n",
       "       [52, 53, 54, 55]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.add(b_col, c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NumPy documentation has a much more extensive discussion of ufuncs:\n",
    "\n",
    "https://docs.scipy.org/doc/numpy/reference/ufuncs.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making ufuncs for the GPU\n",
    "\n",
    "Numba has the ability to create compiled ufuncs.  You implement a scalar function of all the inputs, and Numba will figure out the broadcast rules for you.  Generating a ufunc that uses CUDA requires giving an explicit type signature and setting the `target` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import vectorize\n",
    "\n",
    "@vectorize(['int64(int64, int64)'], target='cuda')\n",
    "def add_ufunc(x, y):\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a+b:\n",
      " [11 22 33 44]\n",
      "\n",
      "b_col + c:\n",
      " [[10 11 12 13]\n",
      " [24 25 26 27]\n",
      " [38 39 40 41]\n",
      " [52 53 54 55]]\n"
     ]
    }
   ],
   "source": [
    "print('a+b:\\n', add_ufunc(a, b))\n",
    "print()\n",
    "print('b_col + c:\\n', add_ufunc(b_col, c))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A lot of things just happened!  Numba automatically:\n",
    "\n",
    " * Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements.\n",
    " * Allocated GPU memory for the inputs and the output.\n",
    " * Copied the input data to the GPU.\n",
    " * Executed the CUDA kernel with the correct kernel dimensions given the input sizes.\n",
    " * Copied the result back from the GPU to the CPU.\n",
    " * Returned the result as a NumPy array on the host.\n",
    "\n",
    "This is very convenient for testing, but copying data back and forth between the CPU and GPU can be slow and hurt performance.  In the next tutorial notebook, you'll learn about device management and memory allocation.\n",
    "\n",
    "You might be wondering how fast our simple example is on the GPU?  Let's see:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 15.95 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000000 loops, best of 3: 1.29 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit np.add(b_col, c)   # NumPy on CPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 691 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(b_col, c) # Numba on GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow, the GPU is *a lot slower* than the CPU??  This is to be expected because we have (deliberately) misused the GPU in several ways in this example:\n",
    "\n",
    "  * **Our inputs are too small**: the GPU achieves performance through parallelism, operating on thousands of values at once.  Our test inputs have only 4 and 16 integers, respectively.  We need a much larger array to even keep the GPU busy.\n",
    "  * **Our calculation is too simple**: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU.  If our calculation does not involve enough math operations (often called \"arithmetic intensity\"), then the GPU will spend most of its time waiting for data to move around.\n",
    "  * **We copy the data to and from the GPU**: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence.  In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.\n",
    "  * **Our data types are larger than necessary**: Our example uses `int64` when we probably don't need it.  Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU.  Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats.  NumPy defaults to 64-bit data types when creating arrays, so it is important to set the `dtype` attribute or use the `ndarray.astype()` method to pick 32-bit types when you need them.\n",
    "  \n",
    "  \n",
    "Given the above, let's try an example that is faster on the GPU:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math  # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy\n",
    "\n",
    "SQRT_2PI = np.float32((2*math.pi)**0.5)  # Precompute this constant as a float32.  Numba will inline it at compile time.\n",
    "\n",
    "@vectorize(['float32(float32, float32, float32)'], target='cuda')\n",
    "def gaussian_pdf(x, mean, sigma):\n",
    "    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''\n",
    "    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.38114083], dtype=float32)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Evaluate the Gaussian a million times!\n",
    "x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)\n",
    "mean = np.float32(0.0)\n",
    "sigma = np.float32(1.0)\n",
    "\n",
    "# Quick test\n",
    "gaussian_pdf(x[0], 0.0, 1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 51.7 ms per loop\n"
     ]
    }
   ],
   "source": [
    "import scipy.stats # for definition of gaussian distribution\n",
    "norm_pdf = scipy.stats.norm\n",
    "%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 15.15 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10 loops, best of 3: 7.69 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit gaussian_pdf(x, mean, sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a pretty large improvement, even including the overhead of copying all the data to and from the GPU.  Ufuncs that use special functions (`exp`, `sin`, `cos`, etc) on large data sets run especially well on the GPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CUDA Device Functions\n",
    "\n",
    "Ufuncs are great, but you should not have to cram all of your logic into a single function body. You can also create normal functions that are only called from other functions running on the GPU.  (These are similar to CUDA C functions defined with `__device__`.)\n",
    "\n",
    "Device functions are created with the `numba.cuda.jit` decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import cuda\n",
    "\n",
    "@cuda.jit(device=True)\n",
    "def polar_to_cartesian(rho, theta):\n",
    "    x = rho * math.cos(theta)\n",
    "    y = rho * math.sin(theta)\n",
    "    return x, y  # This is Python, so let's return a tuple\n",
    "\n",
    "@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')\n",
    "def polar_distance(rho1, theta1, rho2, theta2):\n",
    "    x1, y1 = polar_to_cartesian(rho1, theta1)\n",
    "    x2, y2 = polar_to_cartesian(rho2, theta2)\n",
    "    \n",
    "    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 1000000\n",
    "rho1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)\n",
    "theta1 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)\n",
    "rho2 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)\n",
    "theta2 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 2.30478978,  0.23699605,  1.02151287, ...,  1.00132406,\n",
       "        0.45568103,  0.3965269 ], dtype=float32)"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "polar_distance(rho1, theta1, rho2, theta2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the CUDA compiler aggressively inlines device functions, so there is generally no overhead for function calls.  Similarly, the \"tuple\" returned by `polar_to_cartesian` is not actually created as a Python object, but represented temporarily as a struct, which is then optimized away by the compiler."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Allowed Python on the GPU\n",
    "\n",
    "Compared to Numba on the CPU (which is already limited), Numba on the GPU has more limitations.  Supported Python includes:\n",
    "\n",
    "* `if`/`elif`/`else`\n",
    "* `while` and `for` loops\n",
    "* Basic math operators\n",
    "* Selected functions from the `math` and `cmath` modules\n",
    "* Tuples\n",
    "\n",
    "See [the Numba manual](http://numba.pydata.org/numba-doc/latest/cuda/cudapysupported.html) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "\n",
    "Let's build a \"zero suppression\" function.  A common operation when working with waveforms is to force all samples values below a certain absolute magnitude to be zero, as a way to eliminate low amplitude noise.  Let's make some sample data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x11d2be3c8>]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAD8CAYAAACW/ATfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8FdXZB/DfQxYIIWENEAIkLGEVAUkRlCKbbFqxrVro\nq2KrVatdbKsFtGpb17q0r0tt1bpVraKI1RcVZHFBVCCRLQlLIgmQELIBgQSyn/ePTO6dm9ybu83M\nuXPm+X4++TB37tw7z1xunsycOec5JIQAY4wxtXWSHQBjjDHzcbJnjDEH4GTPGGMOwMmeMcYcgJM9\nY4w5ACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TLDgAA+vTpI9LS0mSHwRhjtpKVlVUhhEgKZNuI\nSPZpaWnIzMyUHQZjjNkKER0KdFtuxmGMMQfgZM8YYw7AyZ4xxhyAkz1jjDkAJ3vGGHMATvaMMeYA\nnOwZY8wBONl7UXTiDD7dXyY7DMZcsoursOvISdlhKEEIgbczj6CusUl2KJbiZN/G61sPYdpfPsF1\nL22XHQpjLpc+9QUW/X2L7DCUsC6nFHes2o3/eX6r7FAsxcle52x9E+56N1t2GIz5lLb8A+QcrZId\nhq2dqm0AAGQeOoFtBcclR2MdTvY6m/PKZYeglLTlH7h+nHbJbKaf8FVnyJ7YkIffr9rtenzVs19J\njMZanOx1Hvxwr8fj4pNnJUVif8t0v1AAUFldLykSezt5ph5pyz/wWFd2uk5SNPb3tw0HZIcgDSd7\nncLKMx6PL3x4ExqbmiVFY28rM494PL7g4U2obeCz+2AdPVnrdX1rUwQL37Eq75+xajjZa97bWex1\n/b3v51gcif35Sup3rt5jcST2tzbnmNf1/93h/fvKgld2mpO9Y5yubcCv39zp9bnXtx62OBr7G3X3\nWq/rV3OCCtqTG/O8rr/nPT4JCZavq/TLnnZGLydO9gA+yvZ+9sSCV1HN7cksMj37+UGfz9XUNVoY\niRyc7AGPu/MsPKu/KerweSGERZEw5unRdft9PvfYx76fUwUn+wAUnTjjfyMGAMgrre7w+fvW7O3w\neRa4VVkd/2FlbtsLO+5P/9KWQmsCkYiTfQCyDp2QHYJtvO0nAb24pcCiSOzvHT+f5e1v77IoEvvj\nq3dO9gHxdfOWMTP9jpO5YQoqamSHIB0ne2YY7kdvHP4srXeiRu2Bf45P9tUOuAtvldXfcNdKo7zj\n50Z3q4/2lJgciXNc/YLahdH8JnsiGkREnxBRLhHlENGvtfW9iGg9EeVp//bUvWYFEeUT0X4immfm\nAYQr08+Nm1ZNzdyLxJ9N+7gstFEC7bT089e/MTcQB8k5ekp2CKYK5My+EcDvhBBjAEwBcCsRjQGw\nHMBGIUQ6gI3aY2jPLQYwFsB8AM8QUZQZwRsh0FLGw+780ORI7G/D3tKAtttbovYvlRG4Gcc4Tqps\n2RG/yV4IUSKE+EZbPg1gL4AUAIsAvKJt9gqAy7XlRQDeFELUCSEKAOQDmGx04My+Fjyxmfvb+3H/\nB9xF1ShOqmzZkaDa7IkoDcBEAFsB9BNCtDYYHgPQT1tOAaCvglWkrWv7XjcSUSYRZZaXyykt3Oyj\naeZ3F4+wOBJ1TU7r5XX9O9y+zyT6/I6ZXtdvPVhpcSTWCTjZE1E3AO8AuE0I4XEdLlpO04I6VRNC\nPCeEyBBCZCQlJQXzUsO8uf2I1/W/nJ2Or1fMtjgaNb1181SsvHFKu/VcViF4REBa767t1vs6aWG+\nDe7dFfddfk679VVn1a0mGlCyJ6IYtCT614UQq7XVpUSUrD2fDKD17lwxgEG6lw/U1kWcgor2oz3v\nXDgKABDVido9d7ae21GDceWkgQCAmOj2XzNuxQnOTy5Mw8EHF2L+OcntnlsVYM8d1uKzO2YAcH8/\n9VT+uxlIbxwC8AKAvUKIv+qeeh/AUm15KYD3dOsXE1FnIhoCIB3ANuNCNs7zm9uP5rxx+jAAQFJC\n53bPPbXJewVC5v1M/Y+XjQUATBzUo91zj67bZ3pMdlXf2L46473fG4uWX8X2eHSob97uDaX2jgcA\ndImJwtCkeI/n1ucG1snAjgI5s78QwDUAZhHRTu1nIYCHAVxMRHkA5miPIYTIAfAWgFwAawHcKoRQ\n4pT4mU+/lR1CxPrb+vYzAMV3jgYAEBF23H2xx3Mqn0GF60y977Efo5MTLIzE/h5e63lSkd63m8fj\nt26a6vE40PENdhTtbwMhxBcAvJ9SAF4btoUQDwB4IIy4pBiX0l12CLbVtu7/nj/O9XjcPS7GynBs\nraP+3osmpKBH11gsfTEiL5YjzjtZni3Ib7a5f5TYxTnfS8ePoNXrl+jZdPPE4gmSIrG/hDa/RJ28\n3ANh3v3PvzxHct5z6RiPxxeNkNOhwY7aTt/Yu5vn73isl/tJqnLOkbbhrQfD7fNGejxeNKFdj1HG\nLPfTaUNkh2Bb3u5/OJVjk33Zac8bij+YmIJR/RMlRcNYcNqekR45znMusI45Ntm3bVUY2d/7ja+U\nHnEWRKOWF5ZmeF1/+YQBFkeirivadBtc52NicuZfcvcuHo8XPf2FpEjM5dhk39imGWfsAO83Z6+7\nIM2CaOytbZPYuIHeP8vHr/K8B8Jno+0FOivanNF927zurBnhKGXe2H5e16//7UUej3cVVVkRjuUc\nm+xvW+k5Icm09D5etzt/qPfh/sztdK1nV8G+CV28btd2oFpe2WnTYrKr0lOezYt7/zzf63azRnkm\nrl1FJ02LSRUj+3m/eu/W2W+nRCU4NtkHWgnvnDZn/FzAq73xf/44pNfVNvDNs7bazq8QFxtYwdgd\nhznZ++NtoKSTODbZB6rtoMXdil7iyfDs5wdlhxBxuP+8cfJKPa8cLxvvu3fdKB/37FTCyR7AL2cN\n9/lc2yHqfCPMOLuO8NloOBaO6y87hIh206tZHo+7d/U9gGpQL88Ccw1N6l11crIH8L3xgfcS4ZIJ\nHfv5jGEdPj92AHdvNcqKBaNlhxDRDgYxyfhtc9I9HlfXqjddqSOTfWWbol0jfNy4Yf6dbjNCcfao\nvj62bHHJue2rNjLvYqI6HnXc9myUhc5XbzyVODLZnwryr3ZcTMTOqihdsGWfuStr4OaODa6ZpvRU\nrUmROI+K3TAcmewbg2yPG9LHswwq98jRCbLkTddYZ3RzM8K5QRbm27iXJ3z35Vcd3Jfz5ov8CpMi\nkceRyf7JTflBbd+2IFr5aZ5lqRW1yfbjvdSu70gT1zr2afHkwUFtz5OU+/ar2en+N9Jv/8YOkyKR\nx5HJfn1ucD1q0tu06XN6cms7jVtMVHBfKZWngQtXsGWh+XvpW3SQ30sVOfITCHcwz9GTPDS9VUEQ\nPR684a6sbsE2L7b1+YFygyJxpljF/yCofXQB+OtV44N+zWGu6eLys39nBv2aacPdpSlWrN5jZDi2\n9tNXgv8sr8pwF0T7jJO9Syj31b5cMcuESCKH45N9KF0BeeJx7/7+4/MC2u4cnhHMq1DOzO/53lgT\nIrG/tjOnBaJPN7XLKTg+2XeODr5b5XI+G/Xq4jHeqwq2de3UVJMjsb8Xr/NeJrotpxTxCtZDH+6V\nHULEcXyyZ8bxNwio1QCeI8CvSYO52mo4avjqux1O9gHStzMz79rWEWKh66iOS0e8TbfpdDdf1HEJ\nD6fgZB+gB78/TnYIjPl1jEfRtjMsKd7/Rl6E2zsq0jgu2a/PLQ3pdZ1jHPdRMRv6YHeJ7BCkO1Tp\n2R34ohFJAb/2u7pJjAor1ep157gMpu8q2Ds+NuDXBTvAxQmOVRlzFsnlJ4wT6LSGKjt5xnOgXt9E\n7zOneXPj9KGu5WLFxtM4Ltnr+Zpk3JsuXAytnXd3FBvyPntLeHpCozRwmz06hXHvKF7Xu0m1iWQc\nnexvnRlccSQ9rkMC/GXtPtdyOFc+73xTZEQ4tnaqNvSyEcsXjHItf7afB1aF009gwsDgajvZiaOT\nfTh9lFW7xAvXU0smBrV9f92lNRdDA/aFcXWzRFcwjb+X4enUSd0eZY5O9uFMQLy7iKfU09Pf2ArE\nv6+f7FpubFar10MoToVREI7vJ3mqqVNvlikjODrZBzu45/M7ZrqWf7Nyl9Hh2Fqwfez1s4O99nXw\nQ9tVc4Ou48B/fna+xEjs7w//zZYdQkRydLIPVlSAI0QZC8d5g3vKDsHW8sqqXctXTwluTgCVcbIP\nQv8gunAxxuS7ftpQ/xs5BCf7IEQpfPOGRQ4edmCctN7hTcpe36jO/SRHJXsevMPsQPCcU4YJt15T\nOF1iI43fZE9ELxJRGRFl69b9kYiKiWin9rNQ99wKIsonov1ENM+swEPBPfyYHYR7TpJfxoPUwnHX\nwtGu5WaFThADObN/GcB8L+v/JoSYoP18CABENAbAYgBjtdc8Q0QRM/T0P1sPuZYfueJciZEw5lu4\n6aXEoDIWTjV7dF/X8uYDFRIjMZbfZC+E+BzA8QDfbxGAN4UQdUKIAgD5ACb7eY1l7n4vx7U8OY3r\nhYfjeE29oe/HNV3CEx/rPqc6pFgBL6vF6Oai/ThXnTmSw2mz/yUR7daaeVr7iqUAOKLbpkhbF3H4\nZmt4Tp5xJ/tbZ4ZfL3zJ81+H/R521fYmoD5xB+pe3fSEG/eGVtmVtUjRjb9Zl6POZxlqsv8HgKEA\nJgAoAfB4sG9ARDcSUSYRZZaXW1/Po2sIv1BtObk+jn5Y/o/PD22aQf2kEkeOO3eYf0GFZ0neUG4q\npvR0J6g9xafCjsmujCi9oWrJhJCSvRCiVAjRJIRoBvA83E01xQAG6TYdqK3z9h7PCSEyhBAZSUmB\n15s2Sm8DJhf+KNu5tcOvecFdETChS2g1hq7MGGhUOLamL70xZ3Rg8/i2daFuJrWK6rqwY7IrlbpK\nGi2kZE9EybqH3wfQ2lPnfQCLiagzEQ0BkA5AqTqhc3Q3b77Iq5QYSeRI7BJabZYUnosWAHDHqt2u\n5Vmj+nawJfOHu636FkjXyzcAfAVgJBEVEdH1AB4hoj1EtBvATAC/AQAhRA6AtwDkAlgL4FYhhFJt\nHY9cMd61rFIfXBYZlkwe5H8j5lOjrhnnoR/wVKJ6fq+/hRBLvKx+oYPtHwDwQDhBRbJEXZNFqFMc\nMuYLT9oenhXv7HEtzx/b35D3FEIo8f/iqBG0RoiO4o/MKJ2j+bNkxvpgj/s+Wo+uxpR+VmW+Bf5t\nY9KocLbEIpdR3y9VvqeOTPZGdLtkjDlD6Sk1RiQ7Jtnru2T14Jl9wnKYR2hGPC76Zxx9fXw7c0yy\n/2R/mWtZ1UETVqkKYwo9Zp4+urEjnx3gicfD8evZ6a7lpS+q0XvcMcl+g67nzN9+NEFiJPZnZBPm\n7XNHuJbP1PPcoeG4ZYZ7RPLGvWUdbMn8UXHAn2OS/UHdkPThSd0Me18njtjrZGC2H9w73rWsSKcH\naXp3i3Utv511pIMtmT8qDvhzTLLPOnTCtRxt4FyylTXOG5reSfeteekn3wnrvXrHuxOU0ytfzhgZ\nXtmQy8YPcC3XNjjvJMRIqvTA0XNMstfr1jm0Wi6tZuuGtK/c7rwzqBc2F7iWZ44Mb3j/uQO7u5b3\nlji3gBcAxIf5vVQxQTHjODLZhytO13Xzq2+dVx/n7awiw95Ln6Be2lJo2PvaxatfFbqWozhZMxNx\nsg/B9BHuy+2tBYHO68K8iYtx/+HcXVQlMRI59BPq/HTaEImR2J/+Bv+EQT0kRhKZHJnsw73c1beN\nsvDwJDJuo5MTZIdga/oeSM9dO0liJJHJkck+XF1ieAQuMx6B//CF45dv7HAt903oIjGSyMTJnrEI\nEcuF4ZiJ+NvFGGMO4Lhkf9P0obJDYIwxyzki2euLQl16Lt9cZYwF53hNvewQwuaIZH9IV6WRuzKH\np65RqVkmGfNJP/n795/ZIjESYzgi2e8qOula5mQf2Xhe3/A8dqV7juTGJi6ZEI4Hv3+Oa7mmzv4n\nOY5I9q9+dUh2CCxATqqVX3XG+D9sCbo5kvUnOSx4fRPd3Tcrqu1fA8sRyb5BV04xKaFzB1syf/Rt\nl3fMGykxEvurMKGInr7u0z260bmMOSLZ7zriPsPhwRbh0V8l3TpzuCHvqS8n+/VB59Qa0k8m9eSS\niYa8Z2IX9yxsOUedU1iOZ+byzxHJ3gz3XDrGtazKHJWBeObTbw1/z2umprqWX/6y0PD3j1T6Kp9z\nRodXPdTpDpSqMXWgmTjZh2jykF6u5eo6nmEpHEunprmWi06clReIxfRXnF1jwytv3GrMgERD3sdu\nGvhmtF+c7EOkv2p8aUuB7w2ZX/qS0U7yry+M/944tbBcje6ESz8XL3PjZB8iAXe2f+3rwxIjYYzp\nZ5/7esUsiZFELk72IerfnW/0MhYp9Cdc0VHmpDW7zzfNyT5E3KuHscjx7o5i0/fR1GzvHj+c7Blj\nLAD6pls74mTPmMK4/7lx7P5RcrJnjDEfxqV0dy3bPNern+wPVda4lh/6wTiJkTBmjekjklzLdj8b\nle1H3xnkWt552N61hpRP9keOuwfpXD4hRWIk9qcvUmb0pOv6kgksPD/77hDZIShjyeTBruX9pacl\nRhI+5ZN9Y7O7u5Tdb7DIpi8/fLeuXIQRli8YZej7OVmv+FjXcj2PLA2LfpDaYV0rgR35TfZE9CIR\nlRFRtm5dLyJaT0R52r89dc+tIKJ8ItpPRPPMCjxQK7cfkR2CMvRdz3p2jelgy+ANTYo39P2cLL1v\ngmt5VVaRxEisN6JfN9Pe+xWbl0oP5Mz+ZQDz26xbDmCjECIdwEbtMYhoDIDFAMZqr3mGiKSOhf8o\n+5hr2aj6I61unTnM0PeLdPrCXUYPXOkS48ySCQBwVcZAQ98vNtr9f7Ot4Lih7x3p+B6Fb35/Y4UQ\nnwNo+41ZBOAVbfkVAJfr1r8phKgTQhQAyAcw2aBYI05yd2e1M28rNC9xdNJNIab/o6IqfZfIv/zw\nXNP280V+hWnvHYk41/sW6ulZPyFEibZ8DEDrZI0pAPTtJkXaunaI6EYiyiSizPLy8hDDkEvf9FBS\npX61xtXfmDdKsUecu1lo074y0/YTKQ4f18+LbF7xMhUmyvan6qz7XpIzy8AFJuxrcdFyihL0H1Qh\nxHNCiAwhREZSUpL/F0QgfdvoxzmlEiOxv566m4of56r/WR496Zw5EMymr3ip7z3DPIWa7EuJKBkA\ntH9bT8WKAQzSbTdQW6ek7rqz0Xvf5yngjKKv866qf35m/CQwTqVP9j8+n5O9L6Em+/cBLNWWlwJ4\nT7d+MRF1JqIhANIBbAsvxMilvxHGWDA+O2DPpstIpG/2c/KNfn/8dk8hojcAzADQh4iKANwL4GEA\nbxHR9QAOAbgKAIQQOUT0FoBcAI0AbhVCNJkUO2OM4aGP9skOwRb8JnshxBIfT832sf0DAB4IJyjG\nGGPGckw7hFOna2OMheeJxRNkh2AIxyT7OG7LY4yFQN811s4lox2T7BdNMLZwF2ORLLGLsaPFnSxK\nl+zzyqolRhIepZN9bYP73nC/RJ5G0Cg3TOOqipFuGReWM8zcsf1cy3aeh1bpZF+oq1Kn7xPPwtO7\nW2fZITA/esfz/5FRYnR1oJq5GScy/XfHUdcy10s3TrRJN7vPH9LLlPd1onm6s9ETDiiZYJWvD1bK\nDiFkSif7dTnuipdTh/WWGIn96ZvEEuPMaQ+eMbKva9nON8L8OVvv/iwH9jTnJER/U3GnA0YkW+Wx\ndQdkhxAypZO9vr5UfGe+YRWOugZ3W+WFw/uYsg9926i+NLVqDuhmPPrDJaNN39+d7+4xfR9OYefJ\nYJRO9lYY2scZk258nOtOvmY1iQ1Lck88ceKMuk0P+lou+vlizVJSxUXXmOLJ/mC5+dOI3TzDPYFJ\no43/6vtzx6rdrmUzS/K2qm1Q97PMPHTCtWz0hDpO1plrVXWIP50w6SfeztL9ErPw3LcmV3YIptno\ngHr9Mswd2192CBGNk32Y9FX2bl+1S2IkzC4qTtfJDkFJZvUSUwUnewMdOa7+bFUsfMUn+XtilMOV\n7hm/ppnUcUAVnOwZY7ZVWeO+Srrk3GSJkUQ+TvaMMdtaqxtLwxOXdIyTPWPMtrKLqyzZz5jkREv2\nYyZO9owpakS/bv43srkt+daUL+ibaP9aQ5zsGVNU52hu1jDK7FF9/W8U4ZRN9vraKhMH95AYCWNy\n2LlCY6SZp+vDb9fxNMom+2bd93x6uvlD0p1izmjrznBULoZmhZH9EmSHoIy+uvkwKqvtOU5C2WR/\nsNw9o8wN3+XJNoxiRamEVifONFi2LxVNSuvpWi49xfVxjHK6ttH/RhFI2WRfrhulmNDF3IlLVK+V\nry/c1Ts+1tR9LZ2a6lpWsRlCXz9pxkhzrzgvHecu5bElv8LUfTnJtoLjskMIibLJ/p1vii3b14M/\nGGfZvmTQj/i8ekpqB1uGL6qT+ytp5yngfKnTHVO8yUXQ9PMONKv3d1OalZlHZIcQEoWTfZFl+xrS\nW+0yx7lHT7mWE02+Sho/qLtr+axuwhRVROnqt5jdIqZvcjtTb8+mB2YcZZO9lRqa1TsD1Xv6k3zX\ncp8Ec5tx9L0eMgvtebkcie55L0d2CKYalqT2CZcRONkboLHJfY2sYg+S/DL3zW6z66/rh7wXVJzp\nYEt7OqQr3HX5hBSJkajltjkjZIcQ8TjZG2B4X/dIxaITXNHQKP/87FvZIRjuyHF3sk9zyCxnVlhw\nDtey94eTvQGsbIdl9qZvEuOmB+NER3Eq84c/IYM9sSFPdggsgu08ctK1bOWYBRVV1/FN52BwsjfY\n21nW9QJizMl2HLZn2QJZONkzxmyppIpHBQeDkz1jzJb0N7uZf5zsGWO29NSmfP8bMZewkj0RFRLR\nHiLaSUSZ2rpeRLSeiPK0f3v6ex/GGIt0f7hktOwQwmLEmf1MIcQEIUSG9ng5gI1CiHQAG7XH0qxY\nMErm7hljiviurlT62Xr7lfIwoxlnEYBXtOVXAFxuwj4CduP0oTJ3z5hUP58xTHYIyuib4J6acP3e\nUomRhCbcZC8AbCCiLCK6UVvXTwhRoi0fA9AvzH0ErVlX4o/7Mhsnlgeu2M4VkwbKDkEZPXXlve14\nczjcQifThBDFRNQXwHoi2qd/UgghiMhrsRjtj8ONADB48OAww/D0dYE1kxA7zaNXnis7BBakYUnu\nUh61DU0etYdY6JpsWDM6rFM1IUSx9m8ZgHcBTAZQSkTJAKD9W+bjtc8JITKEEBlJScZO4qB4EUpp\nRvVPtGQ/d186xpL9yCRjysDPD5Rbvk9V6SehsYuQkz0RxRNRQusygLkAsgG8D2CpttlSAO+FG2Tw\nsVm9RyChi7nVIGU5XeueGnBEv24dbGmcn16YZsl+ZLptTrrl+1RxfgBZ1uwp8b9RhAnnzL4fgC+I\naBeAbQA+EEKsBfAwgIuJKA/AHO2xpbZKmDbs9rkjLd+nFcp00ztadf9Dvx8VS0YDwIXpfSzf5zEe\ncWqYg+U1skMIWsino0KIgwDGe1lfCWB2OEGFa/+xU/43MpiqZ/ayc+2e4iqcO7CH3CBMECeh7fzl\nLwtx00XcO8eplOxeIaNmxjQJZ2pWkH1mbcczqEDESOjZpFItmfyy067l76TxuM1AKJnsdxdVWb7P\nvgldLN+nFWQ3oqxYvUdyBMY5pbv/wcJT2+C+QdovUc3fPaMpmexls2MfXF/+8anc2aJUuqn42X7u\nDWOU4pPuGeEev6pdazLzgpO9CdblHJMdgmHe3VEsOwRl7JNwL0lV23SdMDpH89iBQHCyN8FrXx+S\nHQKLQH//RL05dZl9cLI3QWGlOs04jEWi+kb7DWqSjZM9Y8x2XuWr56Apneznj+0vOwTGmEKumZIq\nO4SQKZ3so6O44iVjzDiDe3WVHULI1E72nTjZM8aMM3NUX9khhEzpZD99hLHVNBmzoyu5pr1h+nRz\n17Sva7TXGBClk/2Fw9UsYcDsLbW3tU0Bs3Rno3acTi+S9OjqTvaNTbLHlwdHuWSvn1SgR9cYiZGo\n5a2bplq6v+suSHMty67PY7Q5o62dvE1frFSlEcmyNdvse6lcst9b4h6lGNNJucOTZnSytZNtDOjh\nrnfSaMNZgToSZfG9pJQe7iuJ3UUnLd23ygoq7FWkT7lsWHrKXdmPp581TrTFfzgTu7ivyux2BuWP\n1YW70nWTzmwvtH6uB1Vd9vQW2SEERblk/4v/7HAt82TjxomLtbb+iP5+i2K5HlecZ+0NU/28s/+3\ny34zLDFjKJfsZbZJvvGzKdL2bYZmic0n3buqe2Yvs3XRjnOnMmMol+xlmjqst2tZhV+q7z7yibR9\nezbjSAvDFFY3iek1KfCHU7Ub9lbhZG8SFX6p9DXDZapVoAeJ/o+/zDP7E2fsP4FK0Qn39/KuhaMl\nRmIvnOxNknuUa5cbZVVWkewQwrbziLsXjMz66ypUi9Q3610z1b61aqzGyd4ksmd4UokKyf7JTfmy\nQ1DG53kVrmUZc/naFX9SJqlXoM0+UuSXVcsOIWxb8iv8b8QCov/jb/WYBTvjZG+ST3m+UabT1eKu\nq4y1xcmeMWYrdZJv2MfatOnInlEzZjf275wVMfYdOy11/3atwmLTsBmzl9N1jbJDYAbpb3G5C6Mo\nm+x/aPGQdMaYMzzw/XGyQwiJssmeb4jZX0KXaFPe98tvK3Cq1v6Di4Lxq1nDw3p9RXUdsg4dhxAC\nG3JLPUqJ+3tdpmLF1y7QjZTfuLdUYiTBUTbZR1qXrLJTtfjP1sMhvfajPSU4UOq7nXJtdgn2B9CO\n+d7OYhRU1OClLQWoCmIkZaJJSdeflB5xhr9n1dkG/Pj5rbj51Sy/25afrsPv3tqFzXn271l1+cQU\nr+v3lpzCxznHfL7u9a2HsD63FBn3b8AP//EV/vR/ubjh35lY/NxXAe034/4NuOKfgW1rR9e/kolP\n9pXJDiMgcn6LTVBZXYfH1x+QHUY7z3yaj20Fx11dMRuamrFUm5jjUGUNXtpSCAC4ftoQDNJNZrzv\n2Cm8v/ModhdV4Qutj3bhw5dgfW4pKqrrsGTyYNe2N7/2jet5vVVZRegaG4WF45KxKqsIt7+9y/Xc\n1wcrMaBkRPgdAAAN2klEQVRHHH5ywRAcO1WLe9/Pwdwx/XCmvhF3LhyN/92Q59r2T4vGGveBBCE2\n2n0uIoTwqGJaWFGDG/6diWumpLo+z1aNTc0YftdHAICbLhqKH0wciJH9W+rxN2jjH/YdO43nPz+I\nbl2isTb7GEb2T8CKBaM89nHr699gW+FxvPNNEQofvgT/2nwQYwYkAgB+/PxW7LznYnywpwR/+G82\n8h9Y6PUEI+vQCXz1rfw+9vrvlhACd76bjQOlp5F16AQAIOdP83D/B7m4c+FoJGh1iQ5V1uCud7M9\n3uflLwsBANsLT4QVz9Ob8jB1WG9MSu3l9fn3dhbj12/uxL775uPj3FKsyzmGC4f1wY/PH+x1eyu1\nraa7Oa8CmYeO4/a5I9s9V9fYhJF/WIt/Xj0J88/pb2WY7VAkFBXKyMgQmZmZIb32ltezkJHaC9lH\nq7D6m2LX+uevzcDFY6ydEQgA0pZ/4Fq+59Ix+POa3HbbtCblBU9s9phsBQD+etV4/PatXe1e480t\nM4bhy28rXUPxCx++BMtW7UZOSRWyiwMr1zA0KR4Hyz0nYdj8+5keRdD23Tffo0yuVdZmH8PNr7Wc\ngV80IgmfHShHr/hYzBiZ5PF/DbRMLr/ypqmYlNoTn+4vw3Uvbfd4fmhSPJbNH4WbAjijv2n6UFw4\nvA+ufXGba12XmE6obfAcKBcTRWjQpqa7dmoqfj9/FC576gsc1Ca1uHrKYLz2dfurubZ/lK3Q1Cww\n7M4PAbTMo1pRXe/x/OjkRI/v4ryx/ZBZeAKVNZ7bebPvvvkYdfdaAMCtM4fho+xj7b5Tg3rF4Wff\nHYp73svxWL9iwShkHz2F+y8/B+P/9LHfffWKj8VxXUwyPkvA8/dc74WlGbj+Fd+57JezhmPH4ZOu\nE7jXrj8f09JDnz6ViLKEEBkBbWv3ZO/rQz/44EJ0ktCU4yseO5P1WWYWHleyCUBGghJCYMiKDy3f\nr9kiLdmH4sklE3HZ+AEhvTaYZG/rNvvK6jqfz8lIToCavYBkfZZJCZ2l7NdMf5bUJKbiRD5xEq42\nzfCrN3b438gAtk72x3RTEEaKmCj1fqlkSe0dLzsEww3po94xyXL+UO/t/cw705I9Ec0nov1ElE9E\ny83YRwS0QLWj2uTYzFhjkhNlh6CMGSOSZIdgK6YkeyKKAvB3AAsAjAGwhIjGGL2f07WRNyoxEv8A\nscjRKz5WdgjKmDC4p+wQbMWsM/vJAPKFEAeFEPUA3gSwyOidPLEx8rpa/nbuCNkhsAimYtu5LN06\nq9FmbxWzkn0KgCO6x0XaOkMFOorPSt06KzN0gbGIJnPGLzuSdoOWiG4kokwiyiwvD22EYt+EyCtI\n1D0uxv9GjLGw6QeKMf/MSvbFAAbpHg/U1rkIIZ4TQmQIITKSkkK70dIznhMrY4wFwqxkvx1AOhEN\nIaJYAIsBvG/0Tvp0U68fNmMs8o3o1012CEEzJdkLIRoB/ALAOgB7AbwlhMjp+FXBWzo1zei3ZIwx\nvy4ZF9qIV5lMa7MXQnwohBghhBgmhHjAjH309NGNLYFvkhrmkR+eKzsEZpCrMtQb3S3LpFT7dfu0\n9QhaX1YsHC07BGVcNsF+ZzDMu3u/J6dUg4r6JBg3XsKqgXa2T/b775/v8XjhuP5YMnmQj62t8evZ\n6V7X3zFvpMWRuK27bbrHhC76AlLThrdU3fNWJVT2vAC3zhxm2nuPHdDxL9njV45vt+7tm6cGvZ+J\ng3sAAB69Qu5VUryPK96+HdQgyv3zvA7bp/19hqoa1T8xoHLLGQFcAay+5QIjQvLL9lUvAeB0bQPq\nG5tRXdcYEfVUhBA4cvwsBvWKw7qcY65681l/mINJ929AXEwUzjY0uUoJ3zJjGH70nUHo1jkasdGd\ncKjyDFJ6xKFnfCwOlldj1uOfed3Pht9Ox4Aecag4XY+HPtqLj7Ldk1Bcd0Ea+nSLxWMfH8B5g3tg\n9S0XoqauEXll1RiX0h1RnQg1dY0429CEnl1jcfTkWQzq1RXHa+oRHUXoHN0Jx2vqkdzd+AlEgtHU\nLHDJk5s9Jpm+bU46Sk7WYmWmeyjHU0sm4uIx/XDiTD0251Xg3IHdUXKyFlVnG1B88iweXbe/3Xvn\nP7AAH2Yfw/Ckbvjy2wpMHtILKT3iUFJVi/jO0RjSJ96juuFLP/kOZo7si+ZmgaF3uitIvrA0A2MG\nJKJZAE1NAtMf/QTjUrrj9Z+dj6ozDRjUqysOV57BoF5x0gdVnaipx/0f7MU73xThugvScNucdBw5\nfhbfe/oL1zYj+nXDgdJq/GLmcNw+byRqG5pQdbYB5z+4EQDw5o1TsPi5r/HYleNxxaSBeGpjnmsu\niZsuGopnPzvoN46hfeLxi1nDPcp5b/79TMTFRuHnr2Vhe+EJXHdBGpZMHoyfvrwdxSfPIqoTucbW\n7Lp3bkR0c25qFjh68qyrJPhz10zCja9m4fsTU/DujmLcdNFQLJs3Cnll1cg5WuU63u5xMag62zKB\nUDiVO4OpegkhhPSfSZMmCVUdrqwRqcvWiCkPbhBCCNHU1Cyam5tFU1Ozx+OONDY1u34e+nCvmPno\nJ67Xt2pubhYNjU2iscn93tsKKkXqsjXiB89sMeHIrNPc3CxSl61x/ZyoqXN9hnMe/1SkLlsj9pWc\n6vD1jdrn3Pq6tp+fL637bLv9S18cFGPu/sjr+wTyfypTdvFJkbpsjSisqHata2pqFjsPnxCpy9aI\nw5U1rs9Lb/GzX4nfrNzh2r5V28+09XOe+dgnInXZGvHylgLR2NQsHl+3z/V5frD7qCg5eVakLlsj\nthdUikbd++UUV4nUZWtEQbk7vvrGJnGmrlGkLlsjRtz1oSmfSzhaj0uIls/mYHm1SF22RuQerfLY\n7qJHNokHP8x1fScD/R76AiBTBJhnlTizj2RNzQI3v5aFmy8aZvlNnaxDJ/DDf3yJacP74LUbzrd0\n30b77cqduHR8MmaN8mxq2l10En9bfwDPXZuBmCjjWyUf+nAvhiV1w1Xfkds0qIrWK6VQzmaFEPjF\nf3bgf6YMxgXDQp/wwwxvbDuMwsoarFhg7f1CR01ewnxrbhb424YDuGZKKvomRt5oY+Y8m/aVor6x\nGfPPSZYdihKCSfbcR1FhnToRfjdX3k1hxtpqe2XGrGP73jiMMcb842TPGGMOwMmeMcYcgJM9Y4w5\nACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TECFoiKgdwKIy36AOgwqBw7MBpxwvwMTsFH3NwUoUQ\nAc3rGhHJPlxElBnokGEVOO14AT5mp+BjNg834zDGmANwsmeMMQdQJdk/JzsAiznteAE+ZqfgYzaJ\nEm32jDHGOqbKmT1jjLEO2DrZE9F8ItpPRPlEtFx2PMEgokFE9AkR5RJRDhH9Wlvfi4jWE1Ge9m9P\n3WtWaMe6n4jm6dZPIqI92nNPkjbRKRF1JqKV2vqtRJRm9XF6Q0RRRLSDiNZoj5U+ZiLqQUSriGgf\nEe0loqkOOObfaN/rbCJ6g4i6qHbMRPQiEZURUbZunSXHSERLtX3kEdHSgAIOdP7CSPsBEAXgWwBD\nAcQC2AVgjOy4gog/GcB52nICgAMAxgB4BMBybf1yAH/Rlsdox9gZwBDt2KO057YBmAKAAHwEYIG2\n/hYA/9SWFwNYKfu4tVh+C+A/ANZoj5U+ZgCvALhBW44F0EPlYwaQAqAAQJz2+C0A16l2zACmAzgP\nQLZunenHCKAXgIPavz215Z5+45X9ixDGBz0VwDrd4xUAVsiOK4zjeQ/AxQD2A0jW1iUD2O/t+ACs\n0z6DZAD7dOuXAHhWv422HI2WgRsk+TgHAtgIYBbcyV7ZYwbQHS2Jj9qsV/mYUwAc0ZJRNIA1AOaq\neMwA0uCZ7E0/Rv022nPPAljiL1Y7N+O0fqFaFWnrbEe7PJsIYCuAfkKIEu2pYwBa53Hzdbwp2nLb\n9R6vEUI0AqgC0NvwAwjO/wL4PYBm3TqVj3kIgHIAL2lNV/8iongofMxCiGIAjwE4DKAEQJUQ4mMo\nfMw6VhxjSLnPzsleCUTUDcA7AG4TQpzSPyda/mwr012KiC4FUCaEyPK1jWrHjJYzsvMA/EMIMRFA\nDVou711UO2atnXoRWv7QDQAQT0RX67dR7Zi9ibRjtHOyLwYwSPd4oLbONogoBi2J/nUhxGptdSkR\nJWvPJwMo09b7Ot5ibbnteo/XEFE0WpoUKo0/koBdCOAyIioE8CaAWUT0GtQ+5iIARUKIrdrjVWhJ\n/iof8xwABUKIciFEA4DVAC6A2sfcyopjDCn32TnZbweQTkRDiCgWLTcw3pccU8C0O+4vANgrhPir\n7qn3AbTeXV+Klrb81vWLtTv0QwCkA9imXTKeIqIp2nte2+Y1re91BYBN2tmGFEKIFUKIgUKINLT8\nf20SQlwNtY/5GIAjRDRSWzUbQC4UPma0NN9MIaKuWqyzAeyF2sfcyopjXAdgLhH11K6i5mrrOmb1\nDQ2Db44sREsvlm8B3CU7niBjn4aWS7zdAHZqPwvR0ia3EUAegA0Aeulec5d2rPuh3bHX1mcAyNae\nexruwXJdALwNIB8td/yHyj5uXcwz4L5Bq/QxA5gAIFP7v/4vWnpQqH7MfwKwT4v3VbT0QlHqmAG8\ngZZ7Eg1ouYK73qpjBPBTbX0+gJ8EEi+PoGWMMQewczMOY4yxAHGyZ4wxB+BkzxhjDsDJnjHGHICT\nPWOMOQAne8YYcwBO9owx5gCc7BljzAH+Hwbqm2M7x3+uAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x11c802160>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Hacking up a noisy pulse train\n",
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "n = 100000\n",
    "noise = np.random.normal(size=n) * 3\n",
    "pulses = np.maximum(np.sin(np.arange(n) / (n / 23)) - 0.3, 0.0)\n",
    "waveform = ((pulses * 300) + noise).astype(np.int16)\n",
    "plt.plot(waveform)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try filling in body of this ufunc:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@vectorize(['int16(int16, int16)'], target='cuda')\n",
    "def zero_suppress(waveform_value, threshold):\n",
    "    ### Replace this implementation with yours\n",
    "    result = waveform_value\n",
    "    ###\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x1200f3518>]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAD8CAYAAACW/ATfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8FdXZB/DfQxYIIWENEAIkLGEVAUkRlCKbbFqxrVro\nq2KrVatdbKsFtGpb17q0r0tt1bpVraKI1RcVZHFBVCCRLQlLIgmQELIBgQSyn/ePTO6dm9ybu83M\nuXPm+X4++TB37tw7z1xunsycOec5JIQAY4wxtXWSHQBjjDHzcbJnjDEH4GTPGGMOwMmeMcYcgJM9\nY4w5ACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TLDgAA+vTpI9LS0mSHwRhjtpKVlVUhhEgKZNuI\nSPZpaWnIzMyUHQZjjNkKER0KdFtuxmGMMQfgZM8YYw7AyZ4xxhyAkz1jjDkAJ3vGGHMATvaMMeYA\nnOwZY8wBONl7UXTiDD7dXyY7DMZcsoursOvISdlhKEEIgbczj6CusUl2KJbiZN/G61sPYdpfPsF1\nL22XHQpjLpc+9QUW/X2L7DCUsC6nFHes2o3/eX6r7FAsxcle52x9E+56N1t2GIz5lLb8A+QcrZId\nhq2dqm0AAGQeOoFtBcclR2MdTvY6m/PKZYeglLTlH7h+nHbJbKaf8FVnyJ7YkIffr9rtenzVs19J\njMZanOx1Hvxwr8fj4pNnJUVif8t0v1AAUFldLykSezt5ph5pyz/wWFd2uk5SNPb3tw0HZIcgDSd7\nncLKMx6PL3x4ExqbmiVFY28rM494PL7g4U2obeCz+2AdPVnrdX1rUwQL37Eq75+xajjZa97bWex1\n/b3v51gcif35Sup3rt5jcST2tzbnmNf1/93h/fvKgld2mpO9Y5yubcCv39zp9bnXtx62OBr7G3X3\nWq/rV3OCCtqTG/O8rr/nPT4JCZavq/TLnnZGLydO9gA+yvZ+9sSCV1HN7cksMj37+UGfz9XUNVoY\niRyc7AGPu/MsPKu/KerweSGERZEw5unRdft9PvfYx76fUwUn+wAUnTjjfyMGAMgrre7w+fvW7O3w\neRa4VVkd/2FlbtsLO+5P/9KWQmsCkYiTfQCyDp2QHYJtvO0nAb24pcCiSOzvHT+f5e1v77IoEvvj\nq3dO9gHxdfOWMTP9jpO5YQoqamSHIB0ne2YY7kdvHP4srXeiRu2Bf45P9tUOuAtvldXfcNdKo7zj\n50Z3q4/2lJgciXNc/YLahdH8JnsiGkREnxBRLhHlENGvtfW9iGg9EeVp//bUvWYFEeUT0X4immfm\nAYQr08+Nm1ZNzdyLxJ9N+7gstFEC7bT089e/MTcQB8k5ekp2CKYK5My+EcDvhBBjAEwBcCsRjQGw\nHMBGIUQ6gI3aY2jPLQYwFsB8AM8QUZQZwRsh0FLGw+780ORI7G/D3tKAtttbovYvlRG4Gcc4Tqps\n2RG/yV4IUSKE+EZbPg1gL4AUAIsAvKJt9gqAy7XlRQDeFELUCSEKAOQDmGx04My+Fjyxmfvb+3H/\nB9xF1ShOqmzZkaDa7IkoDcBEAFsB9BNCtDYYHgPQT1tOAaCvglWkrWv7XjcSUSYRZZaXyykt3Oyj\naeZ3F4+wOBJ1TU7r5XX9O9y+zyT6/I6ZXtdvPVhpcSTWCTjZE1E3AO8AuE0I4XEdLlpO04I6VRNC\nPCeEyBBCZCQlJQXzUsO8uf2I1/W/nJ2Or1fMtjgaNb1181SsvHFKu/VcViF4REBa767t1vs6aWG+\nDe7dFfddfk679VVn1a0mGlCyJ6IYtCT614UQq7XVpUSUrD2fDKD17lwxgEG6lw/U1kWcgor2oz3v\nXDgKABDVido9d7ae21GDceWkgQCAmOj2XzNuxQnOTy5Mw8EHF2L+OcntnlsVYM8d1uKzO2YAcH8/\n9VT+uxlIbxwC8AKAvUKIv+qeeh/AUm15KYD3dOsXE1FnIhoCIB3ANuNCNs7zm9uP5rxx+jAAQFJC\n53bPPbXJewVC5v1M/Y+XjQUATBzUo91zj67bZ3pMdlXf2L46473fG4uWX8X2eHSob97uDaX2jgcA\ndImJwtCkeI/n1ucG1snAjgI5s78QwDUAZhHRTu1nIYCHAVxMRHkA5miPIYTIAfAWgFwAawHcKoRQ\n4pT4mU+/lR1CxPrb+vYzAMV3jgYAEBF23H2xx3Mqn0GF60y977Efo5MTLIzE/h5e63lSkd63m8fj\nt26a6vE40PENdhTtbwMhxBcAvJ9SAF4btoUQDwB4IIy4pBiX0l12CLbVtu7/nj/O9XjcPS7GynBs\nraP+3osmpKBH11gsfTEiL5YjzjtZni3Ib7a5f5TYxTnfS8ePoNXrl+jZdPPE4gmSIrG/hDa/RJ28\n3ANh3v3PvzxHct5z6RiPxxeNkNOhwY7aTt/Yu5vn73isl/tJqnLOkbbhrQfD7fNGejxeNKFdj1HG\nLPfTaUNkh2Bb3u5/OJVjk33Zac8bij+YmIJR/RMlRcNYcNqekR45znMusI45Ntm3bVUY2d/7ja+U\nHnEWRKOWF5ZmeF1/+YQBFkeirivadBtc52NicuZfcvcuHo8XPf2FpEjM5dhk39imGWfsAO83Z6+7\nIM2CaOytbZPYuIHeP8vHr/K8B8Jno+0FOivanNF927zurBnhKGXe2H5e16//7UUej3cVVVkRjuUc\nm+xvW+k5Icm09D5etzt/qPfh/sztdK1nV8G+CV28btd2oFpe2WnTYrKr0lOezYt7/zzf63azRnkm\nrl1FJ02LSRUj+3m/eu/W2W+nRCU4NtkHWgnvnDZn/FzAq73xf/44pNfVNvDNs7bazq8QFxtYwdgd\nhznZ++NtoKSTODbZB6rtoMXdil7iyfDs5wdlhxBxuP+8cfJKPa8cLxvvu3fdKB/37FTCyR7AL2cN\n9/lc2yHqfCPMOLuO8NloOBaO6y87hIh206tZHo+7d/U9gGpQL88Ccw1N6l11crIH8L3xgfcS4ZIJ\nHfv5jGEdPj92AHdvNcqKBaNlhxDRDgYxyfhtc9I9HlfXqjddqSOTfWWbol0jfNy4Yf6dbjNCcfao\nvj62bHHJue2rNjLvYqI6HnXc9myUhc5XbzyVODLZnwryr3ZcTMTOqihdsGWfuStr4OaODa6ZpvRU\nrUmROI+K3TAcmewbg2yPG9LHswwq98jRCbLkTddYZ3RzM8K5QRbm27iXJ3z35Vcd3Jfz5ov8CpMi\nkceRyf7JTflBbd+2IFr5aZ5lqRW1yfbjvdSu70gT1zr2afHkwUFtz5OU+/ar2en+N9Jv/8YOkyKR\nx5HJfn1ucD1q0tu06XN6cms7jVtMVHBfKZWngQtXsGWh+XvpW3SQ30sVOfITCHcwz9GTPDS9VUEQ\nPR684a6sbsE2L7b1+YFygyJxpljF/yCofXQB+OtV44N+zWGu6eLys39nBv2aacPdpSlWrN5jZDi2\n9tNXgv8sr8pwF0T7jJO9Syj31b5cMcuESCKH45N9KF0BeeJx7/7+4/MC2u4cnhHMq1DOzO/53lgT\nIrG/tjOnBaJPN7XLKTg+2XeODr5b5XI+G/Xq4jHeqwq2de3UVJMjsb8Xr/NeJrotpxTxCtZDH+6V\nHULEcXyyZ8bxNwio1QCeI8CvSYO52mo4avjqux1O9gHStzMz79rWEWKh66iOS0e8TbfpdDdf1HEJ\nD6fgZB+gB78/TnYIjPl1jEfRtjMsKd7/Rl6E2zsq0jgu2a/PLQ3pdZ1jHPdRMRv6YHeJ7BCkO1Tp\n2R34ohFJAb/2u7pJjAor1ep157gMpu8q2Ds+NuDXBTvAxQmOVRlzFsnlJ4wT6LSGKjt5xnOgXt9E\n7zOneXPj9KGu5WLFxtM4Ltnr+Zpk3JsuXAytnXd3FBvyPntLeHpCozRwmz06hXHvKF7Xu0m1iWQc\nnexvnRlccSQ9rkMC/GXtPtdyOFc+73xTZEQ4tnaqNvSyEcsXjHItf7afB1aF009gwsDgajvZiaOT\nfTh9lFW7xAvXU0smBrV9f92lNRdDA/aFcXWzRFcwjb+X4enUSd0eZY5O9uFMQLy7iKfU09Pf2ArE\nv6+f7FpubFar10MoToVREI7vJ3mqqVNvlikjODrZBzu45/M7ZrqWf7Nyl9Hh2Fqwfez1s4O99nXw\nQ9tVc4Ou48B/fna+xEjs7w//zZYdQkRydLIPVlSAI0QZC8d5g3vKDsHW8sqqXctXTwluTgCVcbIP\nQv8gunAxxuS7ftpQ/xs5BCf7IEQpfPOGRQ4edmCctN7hTcpe36jO/SRHJXsevMPsQPCcU4YJt15T\nOF1iI43fZE9ELxJRGRFl69b9kYiKiWin9rNQ99wKIsonov1ENM+swEPBPfyYHYR7TpJfxoPUwnHX\nwtGu5WaFThADObN/GcB8L+v/JoSYoP18CABENAbAYgBjtdc8Q0QRM/T0P1sPuZYfueJciZEw5lu4\n6aXEoDIWTjV7dF/X8uYDFRIjMZbfZC+E+BzA8QDfbxGAN4UQdUKIAgD5ACb7eY1l7n4vx7U8OY3r\nhYfjeE29oe/HNV3CEx/rPqc6pFgBL6vF6Oai/ThXnTmSw2mz/yUR7daaeVr7iqUAOKLbpkhbF3H4\nZmt4Tp5xJ/tbZ4ZfL3zJ81+H/R521fYmoD5xB+pe3fSEG/eGVtmVtUjRjb9Zl6POZxlqsv8HgKEA\nJgAoAfB4sG9ARDcSUSYRZZaXW1/Po2sIv1BtObk+jn5Y/o/PD22aQf2kEkeOO3eYf0GFZ0neUG4q\npvR0J6g9xafCjsmujCi9oWrJhJCSvRCiVAjRJIRoBvA83E01xQAG6TYdqK3z9h7PCSEyhBAZSUmB\n15s2Sm8DJhf+KNu5tcOvecFdETChS2g1hq7MGGhUOLamL70xZ3Rg8/i2daFuJrWK6rqwY7IrlbpK\nGi2kZE9EybqH3wfQ2lPnfQCLiagzEQ0BkA5AqTqhc3Q3b77Iq5QYSeRI7BJabZYUnosWAHDHqt2u\n5Vmj+nawJfOHu636FkjXyzcAfAVgJBEVEdH1AB4hoj1EtBvATAC/AQAhRA6AtwDkAlgL4FYhhFJt\nHY9cMd61rFIfXBYZlkwe5H8j5lOjrhnnoR/wVKJ6fq+/hRBLvKx+oYPtHwDwQDhBRbJEXZNFqFMc\nMuYLT9oenhXv7HEtzx/b35D3FEIo8f/iqBG0RoiO4o/MKJ2j+bNkxvpgj/s+Wo+uxpR+VmW+Bf5t\nY9KocLbEIpdR3y9VvqeOTPZGdLtkjDlD6Sk1RiQ7Jtnru2T14Jl9wnKYR2hGPC76Zxx9fXw7c0yy\n/2R/mWtZ1UETVqkKYwo9Zp4+urEjnx3gicfD8evZ6a7lpS+q0XvcMcl+g67nzN9+NEFiJPZnZBPm\n7XNHuJbP1PPcoeG4ZYZ7RPLGvWUdbMn8UXHAn2OS/UHdkPThSd0Me18njtjrZGC2H9w73rWsSKcH\naXp3i3Utv511pIMtmT8qDvhzTLLPOnTCtRxt4FyylTXOG5reSfeteekn3wnrvXrHuxOU0ytfzhgZ\nXtmQy8YPcC3XNjjvJMRIqvTA0XNMstfr1jm0Wi6tZuuGtK/c7rwzqBc2F7iWZ44Mb3j/uQO7u5b3\nlji3gBcAxIf5vVQxQTHjODLZhytO13Xzq2+dVx/n7awiw95Ln6Be2lJo2PvaxatfFbqWozhZMxNx\nsg/B9BHuy+2tBYHO68K8iYtx/+HcXVQlMRI59BPq/HTaEImR2J/+Bv+EQT0kRhKZHJnsw73c1beN\nsvDwJDJuo5MTZIdga/oeSM9dO0liJJHJkck+XF1ieAQuMx6B//CF45dv7HAt903oIjGSyMTJnrEI\nEcuF4ZiJ+NvFGGMO4Lhkf9P0obJDYIwxyzki2euLQl16Lt9cZYwF53hNvewQwuaIZH9IV6WRuzKH\np65RqVkmGfNJP/n795/ZIjESYzgi2e8qOula5mQf2Xhe3/A8dqV7juTGJi6ZEI4Hv3+Oa7mmzv4n\nOY5I9q9+dUh2CCxATqqVX3XG+D9sCbo5kvUnOSx4fRPd3Tcrqu1fA8sRyb5BV04xKaFzB1syf/Rt\nl3fMGykxEvurMKGInr7u0z260bmMOSLZ7zriPsPhwRbh0V8l3TpzuCHvqS8n+/VB59Qa0k8m9eSS\niYa8Z2IX9yxsOUedU1iOZ+byzxHJ3gz3XDrGtazKHJWBeObTbw1/z2umprqWX/6y0PD3j1T6Kp9z\nRodXPdTpDpSqMXWgmTjZh2jykF6u5eo6nmEpHEunprmWi06clReIxfRXnF1jwytv3GrMgERD3sdu\nGvhmtF+c7EOkv2p8aUuB7w2ZX/qS0U7yry+M/944tbBcje6ESz8XL3PjZB8iAXe2f+3rwxIjYYzp\nZ5/7esUsiZFELk72IerfnW/0MhYp9Cdc0VHmpDW7zzfNyT5E3KuHscjx7o5i0/fR1GzvHj+c7Blj\nLAD6pls74mTPmMK4/7lx7P5RcrJnjDEfxqV0dy3bPNern+wPVda4lh/6wTiJkTBmjekjklzLdj8b\nle1H3xnkWt552N61hpRP9keOuwfpXD4hRWIk9qcvUmb0pOv6kgksPD/77hDZIShjyeTBruX9pacl\nRhI+5ZN9Y7O7u5Tdb7DIpi8/fLeuXIQRli8YZej7OVmv+FjXcj2PLA2LfpDaYV0rgR35TfZE9CIR\nlRFRtm5dLyJaT0R52r89dc+tIKJ8ItpPRPPMCjxQK7cfkR2CMvRdz3p2jelgy+ANTYo39P2cLL1v\ngmt5VVaRxEisN6JfN9Pe+xWbl0oP5Mz+ZQDz26xbDmCjECIdwEbtMYhoDIDFAMZqr3mGiKSOhf8o\n+5hr2aj6I61unTnM0PeLdPrCXUYPXOkS48ySCQBwVcZAQ98vNtr9f7Ot4Lih7x3p+B6Fb35/Y4UQ\nnwNo+41ZBOAVbfkVAJfr1r8phKgTQhQAyAcw2aBYI05yd2e1M28rNC9xdNJNIab/o6IqfZfIv/zw\nXNP280V+hWnvHYk41/sW6ulZPyFEibZ8DEDrZI0pAPTtJkXaunaI6EYiyiSizPLy8hDDkEvf9FBS\npX61xtXfmDdKsUecu1lo074y0/YTKQ4f18+LbF7xMhUmyvan6qz7XpIzy8AFJuxrcdFyihL0H1Qh\nxHNCiAwhREZSUpL/F0QgfdvoxzmlEiOxv566m4of56r/WR496Zw5EMymr3ip7z3DPIWa7EuJKBkA\ntH9bT8WKAQzSbTdQW6ek7rqz0Xvf5yngjKKv866qf35m/CQwTqVP9j8+n5O9L6Em+/cBLNWWlwJ4\nT7d+MRF1JqIhANIBbAsvxMilvxHGWDA+O2DPpstIpG/2c/KNfn/8dk8hojcAzADQh4iKANwL4GEA\nbxHR9QAOAbgKAIQQOUT0FoBcAI0AbhVCNJkUO2OM4aGP9skOwRb8JnshxBIfT832sf0DAB4IJyjG\nGGPGckw7hFOna2OMheeJxRNkh2AIxyT7OG7LY4yFQN811s4lox2T7BdNMLZwF2ORLLGLsaPFnSxK\nl+zzyqolRhIepZN9bYP73nC/RJ5G0Cg3TOOqipFuGReWM8zcsf1cy3aeh1bpZF+oq1Kn7xPPwtO7\nW2fZITA/esfz/5FRYnR1oJq5GScy/XfHUdcy10s3TrRJN7vPH9LLlPd1onm6s9ETDiiZYJWvD1bK\nDiFkSif7dTnuipdTh/WWGIn96ZvEEuPMaQ+eMbKva9nON8L8OVvv/iwH9jTnJER/U3GnA0YkW+Wx\ndQdkhxAypZO9vr5UfGe+YRWOugZ3W+WFw/uYsg9926i+NLVqDuhmPPrDJaNN39+d7+4xfR9OYefJ\nYJRO9lYY2scZk258nOtOvmY1iQ1Lck88ceKMuk0P+lou+vlizVJSxUXXmOLJ/mC5+dOI3TzDPYFJ\no43/6vtzx6rdrmUzS/K2qm1Q97PMPHTCtWz0hDpO1plrVXWIP50w6SfeztL9ErPw3LcmV3YIptno\ngHr9Mswd2192CBGNk32Y9FX2bl+1S2IkzC4qTtfJDkFJZvUSUwUnewMdOa7+bFUsfMUn+XtilMOV\n7hm/ppnUcUAVnOwZY7ZVWeO+Srrk3GSJkUQ+TvaMMdtaqxtLwxOXdIyTPWPMtrKLqyzZz5jkREv2\nYyZO9owpakS/bv43srkt+daUL+ibaP9aQ5zsGVNU52hu1jDK7FF9/W8U4ZRN9vraKhMH95AYCWNy\n2LlCY6SZp+vDb9fxNMom+2bd93x6uvlD0p1izmjrznBULoZmhZH9EmSHoIy+uvkwKqvtOU5C2WR/\nsNw9o8wN3+XJNoxiRamEVifONFi2LxVNSuvpWi49xfVxjHK6ttH/RhFI2WRfrhulmNDF3IlLVK+V\nry/c1Ts+1tR9LZ2a6lpWsRlCXz9pxkhzrzgvHecu5bElv8LUfTnJtoLjskMIibLJ/p1vii3b14M/\nGGfZvmTQj/i8ekpqB1uGL6qT+ytp5yngfKnTHVO8yUXQ9PMONKv3d1OalZlHZIcQEoWTfZFl+xrS\nW+0yx7lHT7mWE02+Sho/qLtr+axuwhRVROnqt5jdIqZvcjtTb8+mB2YcZZO9lRqa1TsD1Xv6k3zX\ncp8Ec5tx9L0eMgvtebkcie55L0d2CKYalqT2CZcRONkboLHJfY2sYg+S/DL3zW6z66/rh7wXVJzp\nYEt7OqQr3HX5hBSJkajltjkjZIcQ8TjZG2B4X/dIxaITXNHQKP/87FvZIRjuyHF3sk9zyCxnVlhw\nDtey94eTvQGsbIdl9qZvEuOmB+NER3Eq84c/IYM9sSFPdggsgu08ctK1bOWYBRVV1/FN52BwsjfY\n21nW9QJizMl2HLZn2QJZONkzxmyppIpHBQeDkz1jzJb0N7uZf5zsGWO29NSmfP8bMZewkj0RFRLR\nHiLaSUSZ2rpeRLSeiPK0f3v6ex/GGIt0f7hktOwQwmLEmf1MIcQEIUSG9ng5gI1CiHQAG7XH0qxY\nMErm7hljiviurlT62Xr7lfIwoxlnEYBXtOVXAFxuwj4CduP0oTJ3z5hUP58xTHYIyuib4J6acP3e\nUomRhCbcZC8AbCCiLCK6UVvXTwhRoi0fA9AvzH0ErVlX4o/7Mhsnlgeu2M4VkwbKDkEZPXXlve14\nczjcQifThBDFRNQXwHoi2qd/UgghiMhrsRjtj8ONADB48OAww/D0dYE1kxA7zaNXnis7BBakYUnu\nUh61DU0etYdY6JpsWDM6rFM1IUSx9m8ZgHcBTAZQSkTJAKD9W+bjtc8JITKEEBlJScZO4qB4EUpp\nRvVPtGQ/d186xpL9yCRjysDPD5Rbvk9V6SehsYuQkz0RxRNRQusygLkAsgG8D2CpttlSAO+FG2Tw\nsVm9RyChi7nVIGU5XeueGnBEv24dbGmcn16YZsl+ZLptTrrl+1RxfgBZ1uwp8b9RhAnnzL4fgC+I\naBeAbQA+EEKsBfAwgIuJKA/AHO2xpbZKmDbs9rkjLd+nFcp00ztadf9Dvx8VS0YDwIXpfSzf5zEe\ncWqYg+U1skMIWsino0KIgwDGe1lfCWB2OEGFa/+xU/43MpiqZ/ayc+2e4iqcO7CH3CBMECeh7fzl\nLwtx00XcO8eplOxeIaNmxjQJZ2pWkH1mbcczqEDESOjZpFItmfyy067l76TxuM1AKJnsdxdVWb7P\nvgldLN+nFWQ3oqxYvUdyBMY5pbv/wcJT2+C+QdovUc3fPaMpmexls2MfXF/+8anc2aJUuqn42X7u\nDWOU4pPuGeEev6pdazLzgpO9CdblHJMdgmHe3VEsOwRl7JNwL0lV23SdMDpH89iBQHCyN8FrXx+S\nHQKLQH//RL05dZl9cLI3QWGlOs04jEWi+kb7DWqSjZM9Y8x2XuWr56Apneznj+0vOwTGmEKumZIq\nO4SQKZ3so6O44iVjzDiDe3WVHULI1E72nTjZM8aMM3NUX9khhEzpZD99hLHVNBmzoyu5pr1h+nRz\n17Sva7TXGBClk/2Fw9UsYcDsLbW3tU0Bs3Rno3acTi+S9OjqTvaNTbLHlwdHuWSvn1SgR9cYiZGo\n5a2bplq6v+suSHMty67PY7Q5o62dvE1frFSlEcmyNdvse6lcst9b4h6lGNNJucOTZnSytZNtDOjh\nrnfSaMNZgToSZfG9pJQe7iuJ3UUnLd23ygoq7FWkT7lsWHrKXdmPp581TrTFfzgTu7ivyux2BuWP\n1YW70nWTzmwvtH6uB1Vd9vQW2SEERblk/4v/7HAt82TjxomLtbb+iP5+i2K5HlecZ+0NU/28s/+3\ny34zLDFjKJfsZbZJvvGzKdL2bYZmic0n3buqe2Yvs3XRjnOnMmMol+xlmjqst2tZhV+q7z7yibR9\nezbjSAvDFFY3iek1KfCHU7Ub9lbhZG8SFX6p9DXDZapVoAeJ/o+/zDP7E2fsP4FK0Qn39/KuhaMl\nRmIvnOxNknuUa5cbZVVWkewQwrbziLsXjMz66ypUi9Q3610z1b61aqzGyd4ksmd4UokKyf7JTfmy\nQ1DG53kVrmUZc/naFX9SJqlXoM0+UuSXVcsOIWxb8iv8b8QCov/jb/WYBTvjZG+ST3m+UabT1eKu\nq4y1xcmeMWYrdZJv2MfatOnInlEzZjf275wVMfYdOy11/3atwmLTsBmzl9N1jbJDYAbpb3G5C6Mo\nm+x/aPGQdMaYMzzw/XGyQwiJssmeb4jZX0KXaFPe98tvK3Cq1v6Di4Lxq1nDw3p9RXUdsg4dhxAC\nG3JLPUqJ+3tdpmLF1y7QjZTfuLdUYiTBUTbZR1qXrLJTtfjP1sMhvfajPSU4UOq7nXJtdgn2B9CO\n+d7OYhRU1OClLQWoCmIkZaJJSdeflB5xhr9n1dkG/Pj5rbj51Sy/25afrsPv3tqFzXn271l1+cQU\nr+v3lpzCxznHfL7u9a2HsD63FBn3b8AP//EV/vR/ubjh35lY/NxXAe034/4NuOKfgW1rR9e/kolP\n9pXJDiMgcn6LTVBZXYfH1x+QHUY7z3yaj20Fx11dMRuamrFUm5jjUGUNXtpSCAC4ftoQDNJNZrzv\n2Cm8v/ModhdV4Qutj3bhw5dgfW4pKqrrsGTyYNe2N7/2jet5vVVZRegaG4WF45KxKqsIt7+9y/Xc\n1wcrMaBkRPgdAAAN2klEQVRHHH5ywRAcO1WLe9/Pwdwx/XCmvhF3LhyN/92Q59r2T4vGGveBBCE2\n2n0uIoTwqGJaWFGDG/6diWumpLo+z1aNTc0YftdHAICbLhqKH0wciJH9W+rxN2jjH/YdO43nPz+I\nbl2isTb7GEb2T8CKBaM89nHr699gW+FxvPNNEQofvgT/2nwQYwYkAgB+/PxW7LznYnywpwR/+G82\n8h9Y6PUEI+vQCXz1rfw+9vrvlhACd76bjQOlp5F16AQAIOdP83D/B7m4c+FoJGh1iQ5V1uCud7M9\n3uflLwsBANsLT4QVz9Ob8jB1WG9MSu3l9fn3dhbj12/uxL775uPj3FKsyzmGC4f1wY/PH+x1eyu1\nraa7Oa8CmYeO4/a5I9s9V9fYhJF/WIt/Xj0J88/pb2WY7VAkFBXKyMgQmZmZIb32ltezkJHaC9lH\nq7D6m2LX+uevzcDFY6ydEQgA0pZ/4Fq+59Ix+POa3HbbtCblBU9s9phsBQD+etV4/PatXe1e480t\nM4bhy28rXUPxCx++BMtW7UZOSRWyiwMr1zA0KR4Hyz0nYdj8+5keRdD23Tffo0yuVdZmH8PNr7Wc\ngV80IgmfHShHr/hYzBiZ5PF/DbRMLr/ypqmYlNoTn+4vw3Uvbfd4fmhSPJbNH4WbAjijv2n6UFw4\nvA+ufXGba12XmE6obfAcKBcTRWjQpqa7dmoqfj9/FC576gsc1Ca1uHrKYLz2dfurubZ/lK3Q1Cww\n7M4PAbTMo1pRXe/x/OjkRI/v4ryx/ZBZeAKVNZ7bebPvvvkYdfdaAMCtM4fho+xj7b5Tg3rF4Wff\nHYp73svxWL9iwShkHz2F+y8/B+P/9LHfffWKj8VxXUwyPkvA8/dc74WlGbj+Fd+57JezhmPH4ZOu\nE7jXrj8f09JDnz6ViLKEEBkBbWv3ZO/rQz/44EJ0ktCU4yseO5P1WWYWHleyCUBGghJCYMiKDy3f\nr9kiLdmH4sklE3HZ+AEhvTaYZG/rNvvK6jqfz8lIToCavYBkfZZJCZ2l7NdMf5bUJKbiRD5xEq42\nzfCrN3b438gAtk72x3RTEEaKmCj1fqlkSe0dLzsEww3po94xyXL+UO/t/cw705I9Ec0nov1ElE9E\ny83YRwS0QLWj2uTYzFhjkhNlh6CMGSOSZIdgK6YkeyKKAvB3AAsAjAGwhIjGGL2f07WRNyoxEv8A\nscjRKz5WdgjKmDC4p+wQbMWsM/vJAPKFEAeFEPUA3gSwyOidPLEx8rpa/nbuCNkhsAimYtu5LN06\nq9FmbxWzkn0KgCO6x0XaOkMFOorPSt06KzN0gbGIJnPGLzuSdoOWiG4kokwiyiwvD22EYt+EyCtI\n1D0uxv9GjLGw6QeKMf/MSvbFAAbpHg/U1rkIIZ4TQmQIITKSkkK70dIznhMrY4wFwqxkvx1AOhEN\nIaJYAIsBvG/0Tvp0U68fNmMs8o3o1012CEEzJdkLIRoB/ALAOgB7AbwlhMjp+FXBWzo1zei3ZIwx\nvy4ZF9qIV5lMa7MXQnwohBghhBgmhHjAjH309NGNLYFvkhrmkR+eKzsEZpCrMtQb3S3LpFT7dfu0\n9QhaX1YsHC07BGVcNsF+ZzDMu3u/J6dUg4r6JBg3XsKqgXa2T/b775/v8XjhuP5YMnmQj62t8evZ\n6V7X3zFvpMWRuK27bbrHhC76AlLThrdU3fNWJVT2vAC3zhxm2nuPHdDxL9njV45vt+7tm6cGvZ+J\ng3sAAB69Qu5VUryPK96+HdQgyv3zvA7bp/19hqoa1T8xoHLLGQFcAay+5QIjQvLL9lUvAeB0bQPq\nG5tRXdcYEfVUhBA4cvwsBvWKw7qcY65681l/mINJ929AXEwUzjY0uUoJ3zJjGH70nUHo1jkasdGd\ncKjyDFJ6xKFnfCwOlldj1uOfed3Pht9Ox4Aecag4XY+HPtqLj7Ldk1Bcd0Ea+nSLxWMfH8B5g3tg\n9S0XoqauEXll1RiX0h1RnQg1dY0429CEnl1jcfTkWQzq1RXHa+oRHUXoHN0Jx2vqkdzd+AlEgtHU\nLHDJk5s9Jpm+bU46Sk7WYmWmeyjHU0sm4uIx/XDiTD0251Xg3IHdUXKyFlVnG1B88iweXbe/3Xvn\nP7AAH2Yfw/Ckbvjy2wpMHtILKT3iUFJVi/jO0RjSJ96juuFLP/kOZo7si+ZmgaF3uitIvrA0A2MG\nJKJZAE1NAtMf/QTjUrrj9Z+dj6ozDRjUqysOV57BoF5x0gdVnaipx/0f7MU73xThugvScNucdBw5\nfhbfe/oL1zYj+nXDgdJq/GLmcNw+byRqG5pQdbYB5z+4EQDw5o1TsPi5r/HYleNxxaSBeGpjnmsu\niZsuGopnPzvoN46hfeLxi1nDPcp5b/79TMTFRuHnr2Vhe+EJXHdBGpZMHoyfvrwdxSfPIqoTucbW\n7Lp3bkR0c25qFjh68qyrJPhz10zCja9m4fsTU/DujmLcdNFQLJs3Cnll1cg5WuU63u5xMag62zKB\nUDiVO4OpegkhhPSfSZMmCVUdrqwRqcvWiCkPbhBCCNHU1Cyam5tFU1Ozx+OONDY1u34e+nCvmPno\nJ67Xt2pubhYNjU2iscn93tsKKkXqsjXiB89sMeHIrNPc3CxSl61x/ZyoqXN9hnMe/1SkLlsj9pWc\n6vD1jdrn3Pq6tp+fL637bLv9S18cFGPu/sjr+wTyfypTdvFJkbpsjSisqHata2pqFjsPnxCpy9aI\nw5U1rs9Lb/GzX4nfrNzh2r5V28+09XOe+dgnInXZGvHylgLR2NQsHl+3z/V5frD7qCg5eVakLlsj\nthdUikbd++UUV4nUZWtEQbk7vvrGJnGmrlGkLlsjRtz1oSmfSzhaj0uIls/mYHm1SF22RuQerfLY\n7qJHNokHP8x1fScD/R76AiBTBJhnlTizj2RNzQI3v5aFmy8aZvlNnaxDJ/DDf3yJacP74LUbzrd0\n30b77cqduHR8MmaN8mxq2l10En9bfwDPXZuBmCjjWyUf+nAvhiV1w1Xfkds0qIrWK6VQzmaFEPjF\nf3bgf6YMxgXDQp/wwwxvbDuMwsoarFhg7f1CR01ewnxrbhb424YDuGZKKvomRt5oY+Y8m/aVor6x\nGfPPSZYdihKCSfbcR1FhnToRfjdX3k1hxtpqe2XGrGP73jiMMcb842TPGGMOwMmeMcYcgJM9Y4w5\nACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TECFoiKgdwKIy36AOgwqBw7MBpxwvwMTsFH3NwUoUQ\nAc3rGhHJPlxElBnokGEVOO14AT5mp+BjNg834zDGmANwsmeMMQdQJdk/JzsAiznteAE+ZqfgYzaJ\nEm32jDHGOqbKmT1jjLEO2DrZE9F8ItpPRPlEtFx2PMEgokFE9AkR5RJRDhH9Wlvfi4jWE1Ge9m9P\n3WtWaMe6n4jm6dZPIqI92nNPkjbRKRF1JqKV2vqtRJRm9XF6Q0RRRLSDiNZoj5U+ZiLqQUSriGgf\nEe0loqkOOObfaN/rbCJ6g4i6qHbMRPQiEZURUbZunSXHSERLtX3kEdHSgAIOdP7CSPsBEAXgWwBD\nAcQC2AVgjOy4gog/GcB52nICgAMAxgB4BMBybf1yAH/Rlsdox9gZwBDt2KO057YBmAKAAHwEYIG2\n/hYA/9SWFwNYKfu4tVh+C+A/ANZoj5U+ZgCvALhBW44F0EPlYwaQAqAAQJz2+C0A16l2zACmAzgP\nQLZunenHCKAXgIPavz215Z5+45X9ixDGBz0VwDrd4xUAVsiOK4zjeQ/AxQD2A0jW1iUD2O/t+ACs\n0z6DZAD7dOuXAHhWv422HI2WgRsk+TgHAtgIYBbcyV7ZYwbQHS2Jj9qsV/mYUwAc0ZJRNIA1AOaq\neMwA0uCZ7E0/Rv022nPPAljiL1Y7N+O0fqFaFWnrbEe7PJsIYCuAfkKIEu2pYwBa53Hzdbwp2nLb\n9R6vEUI0AqgC0NvwAwjO/wL4PYBm3TqVj3kIgHIAL2lNV/8iongofMxCiGIAjwE4DKAEQJUQ4mMo\nfMw6VhxjSLnPzsleCUTUDcA7AG4TQpzSPyda/mwr012KiC4FUCaEyPK1jWrHjJYzsvMA/EMIMRFA\nDVou711UO2atnXoRWv7QDQAQT0RX67dR7Zi9ibRjtHOyLwYwSPd4oLbONogoBi2J/nUhxGptdSkR\nJWvPJwMo09b7Ot5ibbnteo/XEFE0WpoUKo0/koBdCOAyIioE8CaAWUT0GtQ+5iIARUKIrdrjVWhJ\n/iof8xwABUKIciFEA4DVAC6A2sfcyopjDCn32TnZbweQTkRDiCgWLTcw3pccU8C0O+4vANgrhPir\n7qn3AbTeXV+Klrb81vWLtTv0QwCkA9imXTKeIqIp2nte2+Y1re91BYBN2tmGFEKIFUKIgUKINLT8\nf20SQlwNtY/5GIAjRDRSWzUbQC4UPma0NN9MIaKuWqyzAeyF2sfcyopjXAdgLhH11K6i5mrrOmb1\nDQ2Db44sREsvlm8B3CU7niBjn4aWS7zdAHZqPwvR0ia3EUAegA0Aeulec5d2rPuh3bHX1mcAyNae\nexruwXJdALwNIB8td/yHyj5uXcwz4L5Bq/QxA5gAIFP7v/4vWnpQqH7MfwKwT4v3VbT0QlHqmAG8\ngZZ7Eg1ouYK73qpjBPBTbX0+gJ8EEi+PoGWMMQewczMOY4yxAHGyZ4wxB+BkzxhjDsDJnjHGHICT\nPWOMOQAne8YYcwBO9owx5gCc7BljzAH+Hwbqm2M7x3+uAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x11aec8320>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# the noise on the baseline should disappear when zero_suppress is implemented\n",
    "plt.plot(zero_suppress(waveform, 15.0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 3 - Memory Management.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 3: Memory Management\n",
    "\n",
    "## Managing GPU Memory\n",
    "\n",
    "During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs.  If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls.  In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.\n",
    "\n",
    "Let's create our example addition ufunc again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import vectorize\n",
    "import numpy as np\n",
    "\n",
    "@vectorize(['float32(float32, float32)'], target='cuda')\n",
    "def add_ufunc(x, y):\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 100000\n",
    "x = np.arange(n).astype(np.float32)\n",
    "y = 2 * x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 111.70 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000 loops, best of 3: 1.34 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(x, y)  # Baseline performance with host arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x112327f28>\n",
      "(100000,)\n",
      "float32\n"
     ]
    }
   ],
   "source": [
    "from numba import cuda\n",
    "\n",
    "x_device = cuda.to_device(x)\n",
    "y_device = cuda.to_device(y)\n",
    "\n",
    "print(x_device)\n",
    "print(x_device.shape)\n",
    "print(x_device.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 429 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(x_device, y_device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a big performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host.  We can create the output buffer with the `numba.cuda.device_array()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 235 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(x_device, y_device, out=out_device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have removed the device allocation and copy steps, the computation runs *much* faster than before.  When we want to bring the device array back to the host memory, we can use the `copy_to_host()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]\n"
     ]
    }
   ],
   "source": [
    "out_host = out_device.copy_to_host()\n",
    "print(out_host[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "\n",
    "Given these ufuncs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "@vectorize(['float32(float32, float32, float32)'], target='cuda')\n",
    "def make_pulses(i, period, amplitude):\n",
    "    return max(math.sin(i / period) - 0.3, 0.0) * amplitude\n",
    "\n",
    "n = 100000\n",
    "noise = (np.random.normal(size=n) * 3).astype(np.float32)\n",
    "t = np.arange(n, dtype=np.float32)\n",
    "period = n / 23"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert this code to use device allocations so that there are only host<->device copies at the beginning and end and benchmark performance change:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "pulses = make_pulses(t, period, 100.0)\n",
    "waveform = add_ufunc(pulses, noise)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x112119160>]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXd4VGXaxu8nCaGX0EMIhBJBeolIE2mKiApibx+uBdeG\nq6su1tV1VfRTVz87omtFRVBBEKQICghI6CVAQmihJIEEEkIqeb8/5iRkZs7MnJk5Zc57nt91cTHz\nzinPmZy5z1ueQkIIMAzDMPYnymoDGIZhGH1gQWcYhpEEFnSGYRhJYEFnGIaRBBZ0hmEYSWBBZxiG\nkQQWdIZhGElgQWcYhpEEFnSGYRhJiDHzZM2bNxdJSUlmnpJhGMb2bNiw4bgQokWg7UwV9KSkJKSm\nppp5SoZhGNtDRAe0bMdTLgzDMJLAgs4wDCMJLOgMwzCSwILOMAwjCSzoDMMwksCCzjAMIwks6AzD\nMJLAgs4wEcTSndnILiix2gzGprCgM0wEcdfnqbj2gz+sNoOxKSzoDBMhVBVsP5RXbLEljF1hQWeY\nCEHRc4YJGRZ0hmEYSdAk6ET0MBHtIKLtRPQ1EdUhoqZEtISI0pX/44w2lmFkhjvoTLgEFHQiSgAw\nBUCKEKIHgGgANwKYCmCZECIZwDLlvW0oKCm32gSGcUPwnAsTJlqnXGIA1CWiGAD1ABwBMB7AZ8rn\nnwGYoL95xpB2tAC9nluMORuyrDZFCjYfOonKShYjhrGagIIuhDgM4DUABwEcBXBKCLEYQCshxFFl\ns2MAWhlmpc7sPlYIAPhtT67Fltif1RnHMeHd1fhk9T6rTbE9/EhkwkXLlEscXL3xDgDaAKhPRLfW\n3Ea4xoqq9yMRTSaiVCJKzc2NDAElcv3PP6DwuWXGOgDAnuxCiy2xP1UzLlX3J8MEi5Ypl9EA9gkh\ncoUQ5QC+BzAYQDYRxQOA8n+O2s5CiOlCiBQhREqLFgErKDE2RQjgizX7cbq0wmpTbIvgLoauZBeU\nOG5aVYugHwQwkIjqEREBGAUgDcA8AJOUbSYBmGuMicbx05YjOMtzv7qwOuM4npm7A8/P22G1KbaH\nO+j6cNvH6/D377bg1BnnOEBomUNfB2A2gI0Atin7TAcwDcAlRJQOVy9+moF2Gsb8rUesNsG2fLLq\n3Lz5kVOu/CP5Dvrx6MnmQydRVlEJAKgUrvdM6BSWlGNP9mkAwO7sQiRNXYANB/Ittsp4NBWJFkL8\nE8A/PZpL4eqt24bKSoGOT/6Mlg1rV7eVlJ+10CL7cvRUMf41f6dXe2kFf5/Bkpl7GhPeXY3r+ret\nbpvw7mrsnzbOQqvshxACV76zCn+9uBO+Wnuwuv36D9cAcHXe+reXO1zGUZGilcqqU05haXXbP+Zs\ns8ocWzPmP7+rtq9MP84PySCpGtV857D5Xr2pFMD2wwV48OtNWJN5wutzJ7j5O0rQffH9Rv4hBUtB\nie/FTxZ0fXj42838XQZB1dqDL+Hem3vaNFusggUdwCOztlhtglSUlFdabYKt8OWm+MOmw3jqh+3m\nGmNjAnXAV6YfN8UOK3GMoJdVVGJWqu+e+InTpT4/Y4Jj4MvLsPGg/AtQZjCHR49MEDhG0P85bwee\n/MH3fPmbS9NNtMbebNIg1hsd4FGgFxk58k8FMObgGEH/+s+Dfj//Yu0BnDxTZpI19mbTwcAudaUV\nPO2ilcdnb/X7eV4R35daOFUc2GV2wdajAbexM44RdC08/O1mq02Qhv/9ZbfVJtgCLVNTgQSfcZFX\nFHja9P6ZG6V2rWVBr8Hy3ZGRa4ZxDhPfC1w/dGlaNjId4KERLlrdEo+dkrcINws6o5kjJ4txzft/\naJ6aYpc7/dhxpMBqE6ThizUHrDbBMFjQGc1M/z0TGw7k4/9+zdC0/eVvrTTYInsTTEGLfceLDLRE\nDvafOKNpuxmr5E31zILOGEYmi5BfngzCx/yNJXsMtEQO7v481WoTLIcF3YPFO45ZbQLjEAJ5XjFM\nsLCge8CVd5hIhTMwMoFwhKCvCiLk1wkJfBh74oT0r6HCBbZdSC/ouYWluPXjdZq31xKcwDBWwKLl\nmxVB1geW9buUXtALSoIT6F3HCrFOJfUmA3y+Zr/VJjCMKmlHg3PrXL5btWKm7ZFe0EN5Et8wfa0B\nltifUKr1cZIu/Xh/xV5pe5bh8uqi4CKT7/hUTo8Y6QV9by67zulBqHUZ12Xm6WyJczlRVIatWaes\nNoOJYKQW9FNnynHPFxusNkMKMkIMPX9l0S6dLXE2FVzUXDdkjGSWWtCzC+XN2cDYm0N52qIaPSmV\nUITCJT27MKT9uj6zSGdLrEdqQWeYSOXjEMPPT7IXlhePfscVx6qQWtAreQFJN3yVSWNC49M/9ltt\ngjQUlfGopQqpBb28ggU9EvgjQ/5ajmYxK/WQ1SZEHFzx6RxSCzr3KiODm2doD+xi/LOCc/YzfpBa\n0KPCUPSs/NAWrWSFn4368eOmw1abIA1zN4f3XVZK5jUktaD/EkbmRK6JyRjFmr0ciawHOQUleOib\n8MpGllfK9TuXWtDfWpYe8r7Ld8kZGsxYz7c8D64LZWflEmM9kFrQw+HfC9KsNiGikGxkamtC9WFn\nvJHNEY4FndHE+yv2hrV/Dgd56QZX5tGP//n4T6tN0BUWdD9w9aJzLE3LDmv/5+bt0MkS5nRphdUm\nRAR69K7/3C9XriEWdD88+cM2q02ICPQQEMnWniwlv6jMahOYCIUFnQnIV2sPhH2MRTuO4UwZ9yz1\ngCMjGV9IK+ilFf5v+imjkgMegxcCXRTrlBDqyMliXY7DMIC2wMGnx51vvCERhCZBJ6ImRDSbiHYR\nURoRDSKipkS0hIjSlf/jjDY2GALN2V7dNyHgMfJ4aAsA+H4jB8LohYwpW63ioAZvn7su6miCJZGD\n1h76WwAWCSG6AugNIA3AVADLhBDJAJYp7yOGr//07+vboXl9bH72EpOssTdafjiMNgJ1NJ4Y29Vx\nvcpQOX5aW4frjet7G2xJ5BBQ0ImoMYBhAD4GACFEmRDiJIDxAD5TNvsMwASjjDSKJvVi0a5pPavN\nkII59w622gRbcOCE/4dj33ZxuOuijnj12l4mWWRfApXju31wEgBgYr+2JlgTGWjpoXcAkAvgv0S0\niYhmEFF9AK2EEEeVbY4BaKW2MxFNJqJUIkrNzY2MxEIrHx9R/fq3x4ZbZ4hE9G8fUTNuEcuaAAXI\nU5Tv8fqURMy8+0Kf2y3YetTnZ4yL567qbrUJpqNF0GMA9APwvhCiL4AieEyvCNejUvVxKYSYLoRI\nEUKktGjRIlx7dYc4JaNu9ExobLUJtubruwciKurc/TioYzOf294/c6MZJkU0q9I5LbMnWgQ9C0CW\nEKIqB+psuAQ+m4jiAUD53zbJT9rG1bXaBKkY2LEpAOArPz1KJjCdWtZ3e8+dDf98tyHLahMijoCC\nLoQ4BuAQEXVRmkYB2AlgHoBJStskAHMNsTAE/Lkszn9wKP9QdKJvuyYAgKfHdQMANKpTy0pzbE/L\nhnWsNsGRyOR5FKNxuwcBfEVEsQAyAfwFrofBLCK6E8ABANcbY2LwPDHHd4RnD5VpgfQXx6K0ohI9\n/vmLkWZJx7s398PMdQfRvU0jTduzX79v5tw7yGoTpOHqvgm4rEdrzdt/snof7hve2UCLzEOT26IQ\nYrMyD95LCDFBCJEvhDghhBglhEgWQowWQkRMUoTvgywgUCs6Cg1qa322MVW0aVIXj47p4jbimXXP\nIHx990DV7WeuO2iWabaja2ttD0UmMC9P7Ikx3bUL+quLdhtojblIGymqFxxcpM6Ukeo9mgEdmmJQ\nJ/XFPP4ufVPfR4diRJfIcySIdOrUivZqi3LILKujBH1AUtOg9ykoLjfAEvszXkOkrSdnHJ6DxFel\nolaNavvc5+YL2xtljpQ8flkX1faxPeNNtsQaHCXo//3LBUHvM2/LEQMssT+1Y4K/dcodXmHm+43q\nXhnf3eM7KCsm2iFdS53wNRf+n+v7IKmZ/EGEjhJ0X8Naf7yxZI8BltiftnHB/zi2ZJ00wBL7sPFg\nvmp7Oz9Cc3FyC3RoXt/n54w2YmOi8MFt/XFRcnOrTTEURwl6IPwNfZ3Kf1fv0+1YJ884e/pqb25R\n0PtERRHeu6WfAdbYm1DK8HVt3Qhf3Cl3rAQLeg3G9wl+XlhmFm0/hud/2unV/qXkPwozSWgSepBb\nZu5pHS2xFy8v1Lfmryy5+lnQa9CqkXpgx4YDEeORaSoH89R7lEMlH7aayeRhgdO7tmmsLvpvLk3X\n2xzbUFCsrwD/uEmOtTIWdA3M2yzHHztYojii1nBSkgInNWtcTz0Cd9F259a81fvWDLdmbqQgnaCf\n8jFPu+Hp0QH3bVpf/YdT4dAQRyNSJDh1tGMEZQ71Gko7WoCVYSTmGqfiwvjrLtukovKLdIK+8ZC6\nJ0GzBoEXPBvWVhf0sgpn/nCMKBn34MxNuh/TDvj6LgnaHprj+7TR0xxb8/ma8GrcCvXEsFIgnaD/\nFIbfeNum6nOVTvWf3hqGm+G7N6t7Zhw5VRLyMe3M6VL1Od84H6NCT168umdQx2V8M/p81dINUiCd\noKvVv7x/RCdN+3I+DXdOl4Ye2TmulzMi87Tiqx8e72PB0xNfuYZumbFOtV1mcgrC6xSktA8+Ytwu\nSCfoajw2pmtY+8s7QPNP2tECr7ZEH6MYNRrW4YRnVRi1vrzlkPOCtXaq3JeMC0cIerhUnHWqpHuj\ndc4XAJ670nklwHzj/b1N7MdxD6EQ7rNRZuctFnQNLNjG9RurCObHcE1/5xTnDYRaTzq5ZUMLLLE/\n4Xpf+XIDlQEWdA/m3j/EahMiGok7N4byz3k7vNp4Sio0isKM6pS5shbfUR70TmxitQkRzZRRyVab\nYEvUvFFuvCDRAkvsT2GJ93f5wIjOiA0hA2hNKs5WIiba3n1ce1vvwdpM9XzTelDpsOCio6fU/aYn\n9uNpFL2wu3hYxVmV3+KjY7qE3dk4/9lFYe0fCUh1R904fa0uxxnS2bvizu7sQl2ObRfu+2qj1SYw\njKmUS+D8IJWg68XgTt7Jp2ReGVej2OHVhRi5kXX9ggVdhcSm8lc2CYQReVyqKK1w1sOiQqdIY1lF\nyApukbS0Hwu6CtEqYlbpsOh/taAivXhxgb65rCOdZ1U8XELhIR9zxE4qvv3nPn2Su00ZpV6qzu6w\noKuglrzHST8aX+hVgT7c5Ep2Y/EOfVKz3jm0g2r76ozQMw/ajes/XKPLcerFyjnaYUHXyLRFzupV\nqvHyxF5Wm2BL9JpiMnIajHFhd2826QX9lWvUs9T5Q6j8Tbcf5vwRrRurV3Ri/KPmN60nrPP6MS+M\nbK2RgPSCfsMF7YLeJ7lVA9V2Nf9Xxj+//v1iq02QnmDy68hIvI4djRW77V3oQmpBbxPiH9pXGt1w\nQ46dSMcW6g9HRj+c3kP/4s4Buh3rR5uXm5Ra0L+9Z5Cux1ObimEYxlpaNOSpwCqkFvRw/Mk5z4Z+\n/GVIklfbobwz5hsSQYTaq74+hVMv6EXdWtFWm6A7Ugt6OMREO3wcqyPntfJOE3vTR/qkabAry/8+\nPKT9urdp7NXm9Du1fmxowjygg3yVi1jQg4B90UOjV1tvEcrK178AtZ1Ial4/pP0a1ZXTfzocQk1y\n9vr1vXW2xHqkEfSlO/UJ3qgiNtr7qT91zlZdz2En2sZpLz3ntW8TTqVQk4uSvXMFaWV8b+8qR/dy\nIrWQaN6gttUm6I5mQSeiaCLaRETzlfdNiWgJEaUr/8cZZ2Zg7vo8VdfjjenuXRm8pMIZ8f9qwRW+\nws61EM3TV278bfR5Ie8bFcXfJeObYHroDwGoGS45FcAyIUQygGXKe2mIVvvhOMTNpVLlOq9LCX2R\n2FfFeqfSvY26W2w4CIfcm4x/NAk6EbUFMA7AjBrN4wF8prz+DMAEfU2zltox8q2AM5FBnTC9K2JU\nOhuHTzp7TYJxobWH/iaAxwHUnHNoJYSoqp58DID3HIVJ6JWetCY9Erx7UVuyTul+nkiE+3r6kW/A\nQnqoC6oy8uq14eUXurxna50siQwCCjoRXQEgRwixwdc2wjXeU9UBIppMRKlElJqbmxu6pX5Qi8jv\nFh/esNbJiZA2Hsi32gRp+Hn70cAbBcmTl3fV/Zh2oFyl43Z9GFOBAPDE2PPD2j/S0NJDHwLgKiLa\nD+AbACOJ6EsA2UQUDwDK/6pJEIQQ04UQKUKIlBYt9Em/qoUoafx3zOcGnUr5BeKMA1IpTFu4S/dj\n1lGZDnTCFHpOYanVJkQ8AWVPCPGEEKKtECIJwI0AfhVC3ApgHoBJymaTAMw1zMoAqOUv12MhLhzP\nDpm4y0ce7mDomeDti55/pjzs40Y6RmRadGpFrY9+z7TahIgnnH7sNACXEFE6gNHK+4jhsTHhD0tH\ndG2pgyX2Z9LgpLCPMfPuC73a2DMjNNQE/eWF8ufr//SP/VabEPEE1Y0VQqwAsEJ5fQLAKP1N0oeW\nDY0JGhBCOG5+PTYm/PmrhnVqebWxnuvHz9uOWW2CNGzLOoWeKtHNdkCKmWY1YTBKc9NzThtz4AhB\nLed7rRBDqwPBgs5YjVq06JXvrLLt6FEKQVcjVgcRUvujFpfJXbF+33HzHlhqax8MoxU9ArTq+kjs\nVWHTYjbSCroedFO5YUrK5RZ0MzsmNv3NMBZw4rS3h8u1/Y1LJWzTDrocgl6m4p+qR6SnE6NF1e7j\nxnW957/1YMRrKww5LiMfe3OLvNoSmoSeMC4Qdh09SiHoH6zY69XWuJ4xIuREVPPaMCGx9JFhVptg\nS9RuQb2cE24a4B2cxD10C3lPRdCN4lSx3L7Td+uctZJxp3NL72IfTGDUtFuvfsZlPeL1OVAEIIWg\nm4kRkX+RxIETzi4NZxcuSLI0W7XpqPXGe7VtosuxLz7PO4J9b649vdmkFPRGdYxL15p53Hsuj9FG\nQ4el0d186KTVJkiDWme8hUGxJgCQns2CHjHUi9VPOJzWEzKSlo3kqxDjjwnvrjbs2PcO7+TVJrNL\nrdnBfLwoahFHT3nngdbzb682HGNC44pebaw2QRpGdvXOVv3Y7C0WWGIOvCyvDdsLeq5KBrYreum3\nyHHn0I5ebUWl8mcJNIKbBrSz2gSpmb9V/1S9kcIRkwt4sJdLBBFOzUZP1CLJ3ly6R7fjOwmHpcBh\ndMTsQtgs6Bah9sXXN3jxrcwhxaL1JooVnbEJajmN7IDtBf2rdQesNoHRiJpXwq5jBRZYwtidJ8Ya\nW7XptcW7DT2+Udhe0HcfK7TaBGlQG3lseHq0oedcv5/L3TH+2aBSErGNgWH/gH2rI9le0PdY4C8q\naz50tZw4zVTSi4aDZ4xAicSudp7ccqG+i8LxjevoerxI5eNVXKlIK7YX9GILsh+quUrKgBmPqe5t\n3AsHvPiz/JV2qhjRRd8KWFf1ZjdQvejaWo6UDLYXdCv4ZUe21SYYghkDj9q1nHvL6R1YZdeFu0jk\n5ykXWW2CLjj318V4QSb00Z0cXKRX7pEqztrVty5ICoqNj/uIkiSjqHSC3oTT5obMYROCN+rWckaO\n+e2HTxl+DrU6rTKyJvOE1SbYBukE/dYL21ttgm0Z/cZvhp+jaf1Yw88RCVzx9irDz/HAiM6GnyMS\n4Kkl7Ugn6DHRcgydZKVPor7TDk4mNka6n69mJHU0Cxvp7ggj5oG/uutC3Y/pVHwV5WWYYBjTvbXV\nJkQk8gm6AU/uIZ2b639QxlFcxgKkK7WijZeuAyfsV/tAPkG32gCJ6N22ceCNGE20b17PlPM4Ic9Q\nzwRz7st9NixmI52gy+J+FAmM7SlPrUWrGX2+d/5yI1iZnmvKeazErLWD8rP2W4yVTtCNLEvlNO4Z\n5p0LngkNs7oZDnFNN4XvN2ZZbULQSCfo1/Vva7UJ0mBWzhonZFzsYdI0wad/7DflPFYyZVSyKedZ\nuP2YKefRE+kEXdbEWTLz0e/7rDbBcOqYFFC1KuO4KeexEi4L6RtbC/onq+QXAicwx4ZDW3+UWJAw\njgmfxnXtH3lra0H/1/ydVpsgDflFZVabIA0rdsu/MGkWeSbel78/PsK0cxmFrQXdTGp7rKyrJd23\nM31fWGK1CdKwZq/80x5mkZlrXr0D7qE7CE9B/3NfnkWWmAMvRYTOZ2u4LCJjDQEFnYgSiWg5Ee0k\noh1E9JDS3pSIlhBRuvJ/nPHm+qdvO+PyhHgutu44Ynw2PStJjDMuEOaafuyJxDBGoKWHXgHg70KI\nbgAGArifiLoBmApgmRAiGcAy5b2lGBm84RmvNH/rUcPOFQn0NDBKVO9CD06mjUoZOl6UdS4BBV0I\ncVQIsVF5XQggDUACgPEAPlM2+wzABKOM1Mp9wzsZduzbBjorLe9bN/Qx7Nhm5OGIJH64b7Bhx16g\nUmlHrTasXSn1SGXw7eSBFlliD4L6ZRFREoC+ANYBaCWEqOqmHgOg2j0moslElEpEqbm5xq7+G+mD\nfqnDkivFGCi6V/V2VkoBI4OK4iTPL//g15vc3jfwKDLOuKP5V0tEDQDMAfA3IYRbaJ8QQgBQDToW\nQkwXQqQIIVJatLBvQIBZkX5OoHNLOQryasVpIxI98XRb7NyygannP1NmfPk7PdF0pxFRLbjE/Csh\nxPdKczYRxSufxwPIMcZEhmGCQeaMi7VjzM2n/58le0w9X7ho8XIhAB8DSBNCvFHjo3kAJimvJwGY\nq795DMMEyzu/Zlhtgm3p2tp99Fhhs/J3WnroQwDcBmAkEW1W/l0OYBqAS4goHcBo5T1jQ+w2rGT8\ns2RnttUm2JaXJvZ0e19QbK/fRsAVBiHEKvjO/jlKX3MYK9h+WP5sh07i8Mliq02wLf3auYfTHLHZ\nd2nb1ZpKmw2FGP+US+Rqx8jDmswTVpsQFLYV9J+2HnF7/+lfLrDIEiYU7rnYvXjGt+sPWWQJw8iD\nbQU9Pds9aU+UBclHFm6TO1rUSDyrtnN0I8OEj20F/Z3l7iv5dWPNdWcCgN3Zhaaf0wgqLahb1qZx\nXbf3W7Pkzo1jJD0SGlltAhMh2FbQPUlpb3xusKt6t3F7X2HDIrJqTP890/RztvbIQTJvyxEfWzKB\nuLovJztjXEgj6GaUnuvpES3qOUqwK2bmnJad46dLTT/nea3MjZ5kIhdpBN0MZM0Rvv/EGbf3kwY5\nKxGZnmzjqSPGQljQg6BNk7qBN5KAS7o5KxGZnny1zvziFv1NmG60gjkb3GvN9jYwpbMssKAHwSXd\njMu3HkkMTW5utQm2ZWma+SmN6sV6xweeLrVXhKMa6/a5+4Bf3tOcLJ3xHus7dpqSZEEPAs6ax/hi\na9ZJXPTqr1abUc0sCf36zUoV7JljPu2ofbzZWKEYBq5I1XB84V9fvAeH8uwVJh7pZOYWub0f2bWl\nKeeNiXZfLMsrMn+hO1RsJ+iPz96CpKkLrDbDcE6dKcfjs7dw4iyTuOHDNej6zKKQ9/flwFrPgvgI\nAEjPsU+vEgAueeM3XPfBH25tqQfy3d5b5ZPwzNwdFp05eGwn6LNSswJvJAFvLtuDWalZ+HLtAeyR\nJIDJSvKKyiD8BFBtPHiy+vWhvDM4EaT7oa9jmyVCvRPdC6QXl9kr8jY95zTW788PvKEJhBNnV3G2\nEqeKy/UzJkhsJ+iRxtFTxgyzq26ql37ehUv/87vtsr6Fwq5jvrM+vrciAzdOXxPScfdkF6LfC0sw\n88+Dmra/6NXlGPDSMuw+VmiblARN6tZye//j5sgN1MopLInoB059DaMqIQT+9dNOrwXTzk8tRO/n\nF1t2fbYS9Kd+2KbabqXbVmm5/yyBL/2chnd+Tcf2w+H5J5884/upnzR1Af4xe2tYx48ELn9rpWq7\nEAKvLtqNtZl5QR/z/5al45ftxwAAK/cc17zf2UqBMW/+jke/2wIA+Ns3mzBkWvCLnp1bmVNuz1+n\n8lRxOV5csNPQSkZbs05inZ/MhKNeX4F7v9wAABjw4jJc+c4qw2wJFy31dDOPF+GT1ftw5dursDrj\nODYezHerf/reigxLRta2EPT07EK8vng3vlqn3sN6YERnky06x0PfuP6IQ6b9iik1/qDL0rLx8ap9\nmP57Jl5bvAdXvO26gTcdzEeXpxcGjCj0F8T05748LNrunhjs21TfXg1FpRU+Q+s9j2MlvjIiz1i5\nz+c+WfnngqKKy87i3/N3VveOftuTizeW7MHrPsqIZeWf8Vqj8FyfWb/f9RD5cfMRv3nGfQ3TP73d\nnCygV/dt49WWdtQ14nl10S58tHIf5m4+XP3ZtqxT+GLNfuzNPY0J765GYYnvDsOtM9bhpZ/TAAD/\nmL0VSVMXVHdQhBA4Wylw1TurccP0tT6PsTe3CAu3H8Nz81zz0Rk5rp7t5kMnke9RNzQSqboPACC3\nsLQ6fXdR2VncMmMdJr73B36q8Rt7+9cMXPqf38PuyAWLLQT9po/W4W0/ZbVqx5h3GQkewUVblMjA\nwyeL3UTzzs9S8cL8nV77f7QyE6UVlZj0yZ8orXAflt3zRSqu/2ANkqYuwOwN6msFxWVncf2Ha/DX\nLzd6fZa6Pw+frPIWv8lfpGLK15tUoxjVjmMlyU/9XP06p6AEh/LOYHPWSbdtCkrKUVkpMG/LEQx9\nZTlWpbt63jNWZmLGqn146JtNKCwpx6RP/nTb74zH9MnQV5bj4v9dgU0Hfc/dCuH+0Hh5YRoqKwVy\nCkvcfL2Fjz6yWa52iXH1vNrGvrUSv+7KxqoM1/dztsYT88p3VuGZuTsw6vXfsPnQSfy2J7f6s7Sj\nBRBCoPxsJW74cA1WZRyvzvdT1XG44u1VOHG6FJP+ux6dnvwZ/qi5vvDpH/urXx85WYwJ765G3xeW\nBH/BJnPdB67pvryiMlzw4lK8smiXpv2+XGtuoFnAikWRQCC3oS6tzasiHygzYWWlwFYfT2UhRHVC\nrx1HCvDm0nQs3HYUj1zaBRsP5OOXHedKhxWWuPcci8sr8OXaA3j6x+0+z32tctPdMbSDW/vqDNdQ\neOr3W718bCON8rMC+UVlWL8/D5O/2OD1eV5RGfq9sAQT+yXg+42uHufDszZj/VOjUa4I1uKd2Zj4\n3h9e+/73Oj4zAAARfklEQVS+JxdCCBBRdc88t7AUV6tsW0VOYSmGvrK8+v2Hv2Xisu6tq/eZc+9g\n3P7fP73+XpHCHZ+mVr/eebQAw15drpqd8YGZmzCgQ1MMeHEZAGBcr3g8PqYL1u071zP17G32//dS\nn+d96odtWJt5Asv+PtznNoNVprBmrMzEyK4t0bFFZOanyT/jGk1oDSD7Zv0hvHh1TxSVVaBRnVqB\ndwgTWwh6oOJEzRrUNscQqA+ttxw614O847P1WLE713sjAB2ecO/JvL9iLwC4TdX4YuG2Y5jh0fs+\neaYMTep59wALSsrRsHYMiAiH8s71LnccKUBW/hlkF5SgVnQUOre09kdzUXJzrEz3ntf212Or8j6p\nEnPAJcr9XliC8+PPPdjTc9Sj+5am5eCSbq3Q7dlfQjXb7X685n3fD4NI4/M1rt7iwbwzqp9XiTkA\nLNh61EuAqqYN/fHywjQs3ZmNvYoP+ZZDJ312cNT494I0/HtBGlY+PsLrs/q1rZWrk2fKcOxUSdD7\nTVuYho9W7sP258eggcHXYAtBjyTOqih6zakWX2IeLp5iDrimddRu8l7PLcaQzs1wTb+2eGTWFrfP\navY2reb163pjwEvLAm9Yg192HFNtzysqqx6J+OPuz1Mxtkd4uWq0ejBM6OM9r20UfTzcFvXga41e\nQTX58Df3VMzj310d0rkvetX9Pk1sWhd1alnj019Fn3+FNjX0kbIGdLqkggU90lDzN/5YRWzNYMMB\n33O/qzNOaBI4KwmlLOxri9UXOINh4Xb1h4JWbv14nabtzBQgLZ4ZdubjSfYvMelrnUVP5L4LDEHS\nHLoWYEWlJDMxI0e/U0i2eHpQD8y43VnQg6SZSV4LVvHJ7Smmnat1ozqBN7IxjeryAFgv+OGoDRb0\nIBnUqZnVJhjKiC7mJEACgKgouX+kj13axWoTmAjCjPEoC3qQ/PPKblabYCjcE9IP2ee1meAwI40E\n33FBwoLHMEwozPITza0XLOiMpdSK5gckE3kYkfa4MhS3riBhQWcsJaV9U6tNYBgv4lQC9sKlZiS4\nUdhe0M3M41KFZ+5pJnSmju1qtQnS0NzEiGnZad1Yfw+sYp5DD8z39w02/ZwvTuhh+jllJdaCBzLD\nyIrtf03xjesG3khnOraob/o5ZcXqcG6GMYvcQuNrk9pe0Bl7w0uiTCTSyaadtrAEnYguI6LdRJRB\nRFP1MiooG6w4KcNEIEYk6HIqj1xiz6CwkAWdiKIBvAtgLIBuAG4iIrmjbiTn/hGdTD+nrG799w03\n/7t85JLzTD+nrETbNIo5nB76AAAZQohMIUQZgG8AjNfHrMiGJB0XNKlrfp4aK9ZAzCCpmflD9m5t\nvAtXMM4iHEFPAFAz9ClLaZOeugYEHTgVWb1cxvWKt9oEJgzs+hs3/NdERJOJKJWIUnNzjSn+wOjD\n5SxCulGXvXdsjdGFKIwiHEE/DCCxxvu2SpsbQojpQogUIURKixYtwjidOvVq8w9HLzwLYDOhI3sm\nSTOx63y2FYQj6OsBJBNRByKKBXAjgHn6mKWd2jEs6AwjM89ewb4WWgl5XCGEqCCiBwD8AiAawCdC\niB26WcYwDAN5PaGMIKyJIiHEzwB+DrghwzBMiHRtzd47WpHTxYBhGGkY0MGajJyrp4605LzhYGtB\nH9RR7nJwTqGNAZntnEq3eO7N6oXeTgIp7eN0PZ4athb0p8adb7UJjA7cMbSD1SZIw5RRyVabwPjA\nDG8dWwu6le5M4/u0sezcsmFFVKWsXNajtdUmMD4wY3HX1oIeZeHyd8M69gw8iEQGdeKpM0Z+zEgZ\nYnNBt+7cHZo3sO7kksGBI4wTSIgzPnDP1oLeuF4tq02QhlaNrCtfxkUuGCcw+vxWhp/D1oLesiF7\nR+jFw6M59apeNKtvftZKxhj0FOF6JiT8srWgW4kQwmoTmAiFOLRRGrrFN9TtWLwoqqCW+WzOvYMs\nsOQcsi2Kto2rZ7UJ0sBLAvIwomtLq00IClsIeuO63nPlCU2sFaBr+ycG3shGtG/mPEFv3sCYqREr\nva8A4K0b+1h6fpno2y4Oaf+6TNO2kZDb33oLNKD2+2hm0I9RK8F6Zvx4/xCfn13eM3zf4aqCCrWi\nve3qrqGSTWJTeQX9yt7qMQM9Exobcj6rZ1zG97FHnZk/bBJaX1ujUI/r6buewHmtGqBfO44UBQB8\n+pcBuPsi92jCmAga1z55edeA2/RJbKL6pO/YvD7eu6V/WOffP20c3r25H/ZPG4dRXV2LOO/f0q/6\n8wVTLsLF5+mfi95orugVj0Z1YrD2iVFen03sq120ru7rEnTPxcqque5Jg9pj/7RxmPeA74euGs9f\n1b369Zd3Xlj92qgHhdVMGdlZ03bLHx2OFyb0wAMjfG8/+vxWaNOkLjo0PxdU1rttZH5v/nLbX9Ov\nbfXruHqx+PXvF6tut/jhi1HfhKIZthD0zi0b4Klx3fDDfYOr2yJh4SmhSV1cfF4LTB7Wye/Nu3/a\nOADBl7Va9+Qo/PbY8Or3m565pPr1xH7B9cI+npSCXS9oGzpGCu/c3A9bnxuD1h65Xq7t3xaDOzf3\nu2/NSN6RXVth/7RxuLrGQ+DBkZ2rI1SvUrb1ldUv48Wx+GbyQLe2bycPxKTBSYiNdv2EhiafsycS\nojXvubij2/v908bhpgHtVLf11VP+8s4L8c3kgWhQOwaz/zoIdw3rqLpdTSYP64gOzevjtoHt8eiY\nLvjpgaFu611X9IrHqn+MwPu3ujocyx8dXj2CsioJVzjcNqh99esbBySiYwvv+JSmJno92Wplr2+7\nOMy5dzCy8s9YbQoA92xsAzs2wzvLMwAA/7isK+4d3gnL0rK9iiB3alEfe3OLqt97zl1P7JuAa/q3\nRVFpBVo1Oidk0VGEuPqxmP/gUFzx9ircOrA9vt/oVSAKAureNzHRUfBVC2SYzXrvFyTFuXkZrXh0\nOIa/tgIAMP/Boeih9JDnbj6iun+PhEZ4cGQyBASGJjdD//YuIYmNicKsewbhwIkixMZE4aFvNgNw\nfXcDPRLBXai8X//0aJSUnwUAvHZdbzz63Rac10o/z4hQeWLs+bg4uQVunrGuuu3liT3x8sSeSJq6\nwG3bFg3dYxBuSEnE2J6tqx9S258fU/3Z2zf1xYNfbwIA/PbYcBQUV+D+mRtxMM/1m/ScVuip9Lp/\nemAoVmUcx73DO3nZes+wjliZnosbB7TDRyv3IbFpXRzKKw710g3ltet64+jJYvRKbIKEJnXQueW5\nv7Xa3/23x4ab6l5tK0EHgP7t49DfhKxlwTI0uTn+Z1B7fL7mQHXbKBUf1hFdWmJv7j4AwCe3pyAl\nyb1X8tp1vb2GeF/fPRCJTV0Phh4Jjat7/P4INIC5Y0gHHMwrwtK0nIDHMpP1T43GxoP5mLnuoOrn\na54YidaN6mD2hiwAriFvUo1hew8N0x3jeydUL2CN7Or+NxrQoWl1T/GRWVsCHqtx3VrVi/bX9m+L\nkV1bmtoj80egUczPUy7CiaJS1IqOQurTo5Hy76UAXGUdh3dR9+64snebakFvr4xwfn98RPVDwlcv\nu2fbxtXi7kmPhMbY/OylAM6NZvOLytD3hSV+7beCa/u3DbxRDRrWqWVqwWnbCXok07ttEwAH0LGF\n72RTbWqk5PQUE0B9vs6IXCfPXtkNK3bnYGlaTkT51LdoWBtjurfGmO7q0xaeIx5//OeG3mjXNPTE\nXztq9ExrsuThYT73iRQxr+K2ge1xsrhc9bNuNRbLmzeojXdu7osHZm4KaYSx+OFhqBcbrdv1xynH\naRQh7sF1akWhpLxS9bPR57d06xitfHwELnp1uVmmuREZ35YkTOyXgC6tG/rtJd4+OAlfrD2Al67u\nGfb5frhvMPafKHJrC6TNvRObIL+oDEBkrENopV5sNM6Una1+73mZl/dsje5t3L/3q/u696YmD+uI\nnUcLNPeyPFMSzL1/CH7fk4vkCJhS0coLE3p4ta36x4jquf+ajOsZj1Z/rRNS3m4jppneurEP+iZG\nxmh80zOX+pzOfP/W/iguP3dvJjath7h6tZB/Rv1BaiQs6DpCRAGH/FFRhOWPDvdq3/jMJSirUO8B\n+KJvuzj09ZizrK2IUHSU+nr33Bruk1WeQpGQS2XFo8NRftb39a99chTKa3w/VYUchp3nmlbQ4inU\nslEdzLx7YMDtfNE7sQl6JzYJef9IwVcQGRHhgiRtC5NqsSF6E0nul/6mTWpFR6GWxwPy5Yk98fLC\nXaaPMMjM4XZKSopITU017XxOJL+oDNNXZuLRS7sgPacQa/eewO1D1AtIVFYKvLksHZMGtUezBtYl\n5wqVgpJyNKrDCdrMJiOnEHH1Ym15z9gVItoghEgJuB0LOsMwTGSjVdBt4YfOMAzDBIYFnWEYRhJY\n0BmGYSSBBZ1hGEYSWNAZhmEkgQWdYRhGEljQGYZhJIEFnWEYRhJMDSwiolwABwJuqE5zAMd1NMcO\n8DU7A75mZxDONbcXQgTMc22qoIcDEaVqiZSSCb5mZ8DX7AzMuGaecmEYhpEEFnSGYRhJsJOgT7fa\nAAvga3YGfM3OwPBrts0cOsMwDOMfO/XQGYZhGD/YQtCJ6DIi2k1EGUQ01Wp7goGIEoloORHtJKId\nRPSQ0t6UiJYQUbryf1yNfZ5QrnU3EY2p0d6fiLYpn/0fKTXkiKg2EX2rtK8joiSzr9MTIoomok1E\nNF95L/X1AgARNSGi2US0i4jSiGiQzNdNRA8r9/R2IvqaiOrIeL1E9AkR5RDR9hptplwnEU1SzpFO\nRJMCGiuEiOh/AKIB7AXQEUAsgC0AulltVxD2xwPop7xuCGAPgG4AXgUwVWmfCuAV5XU35RprA+ig\nXHu08tmfAAYCIAALAYxV2u8D8IHy+kYA30bAdT8CYCaA+cp7qa9XseUzAHcpr2MBNJH1ugEkANgH\noK7yfhaA22W8XgDDAPQDsL1Gm+HXCaApgEzl/zjldZxfW63+EWj4MgcB+KXG+ycAPGG1XWFcz1wA\nlwDYDSBeaYsHsFvt+gD8onwH8QB21Wi/CcCHNbdRXsfAFbxAFl5jWwDLAIzEOUGX9noVOxrDJXDk\n0S7ldcMl6IcUsYkBMB/ApRJfbxLcBd3w66y5jfLZhwBu8menHaZcqm6cKrKUNtuhDKX6AlgHoJUQ\n4qjy0TEArZTXvq43QXnt2e62jxCiAsApAM10vwDtvAngcQA1qz7LfL2AqzeWC+C/ylTTDCKqD0mv\nWwhxGMBrAA4COArglBBiMSS9XhXMuM6gtc8Ogi4FRNQAwBwAfxNCFNT8TLgev1K4GxHRFQByhBAb\nfG0j0/XWIAauYfn7Qoi+AIrgGopXI9N1K3PG4+F6kLUBUJ+Ibq25jUzX649Iuk47CPphAIk13rdV\n2mwDEdWCS8y/EkJ8rzRnE1G88nk8gByl3df1HlZee7a77UNEMXAN/0/ofyWaGALgKiLaD+AbACOJ\n6EvIe71VZAHIEkKsU97PhkvgZb3u0QD2CSFyhRDlAL4HMBjyXq8nZlxn0NpnB0FfDyCZiDoQUSxc\niwbzLLZJM8pK9scA0oQQb9T4aB6AqlXrSXDNrVe136isfHcAkAzgT2V4V0BEA5Vj/o/HPlXHuhbA\nr0qvwXSEEE8IIdoKIZLg+lv9KoS4FZJebxVCiGMADhFRF6VpFICdkPe6DwIYSET1FDtHAUiDvNfr\niRnX+QuAS4koThkRXaq0+caKBYYQFiQuh8s7ZC+Ap6y2J0jbh8I1HNsKYLPy73K45siWAUgHsBRA\n0xr7PKVc624oK+FKewqA7cpn7+BcYFgdAN8ByIBrJb2j1det2DUc5xZFnXC9fQCkKn/rH+HyTJD2\nugE8D2CXYusXcHl2SHe9AL6Ga52gHK6R2J1mXSeAO5T2DAB/CWQrR4oyDMNIgh2mXBiGYRgNsKAz\nDMNIAgs6wzCMJLCgMwzDSAILOsMwjCSwoDMMw0gCCzrDMIwksKAzDMNIwv8DX9Y35IpeRYwAAAAA\nSUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x10e075748>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt\n",
    "plt.plot(waveform)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 4 - Writing CUDA Kernels.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 4: Writing CUDA Kernels\n",
    "\n",
    "## The CUDA Programming Model\n",
    "\n",
    "Ufuncs (and generalized ufuncs mentioned in the bonus notebook at the end of the tutorial) are the easiest way in Numba to use the GPU, and present an abstraction that requires minimal understanding of the CUDA programming model.  However, not all functions can be written as ufuncs.  Many problems require greater flexibility, in which case you want to write a *CUDA kernel*, the topic of this notebook. \n",
    "\n",
    "Fully explaining the CUDA programming model is beyond the scope of this tutorial.  We highly recommend that everyone writing CUDA kernels with Numba take the time to read Chapters 1 and 2 of the CUDA C Programming Guide:\n",
    "\n",
    " * Introduction: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction\n",
    " * Programming Model: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model\n",
    "\n",
    "The programming model chapter gets a little in to C specifics, but familiarity with CUDA C can help write better CUDA kernels in Python.\n",
    "\n",
    "For the purposes of this tutorial, the most important thing is to understand this diagram:\n",
    "![Thread Hierarchy](http://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png \"Thread Hierarchy (from CUDA C Programming Guide)\")\n",
    "\n",
    "We will be writing a *kernel* that decribes the execution of a single thread in this hierarchy.  The CUDA compiler and driver will execute our kernel across a *thread grid* that is divided into *blocks* of threads.  Threads within the same block can exchange data very easily during the execution of a kernel, whereas threads in different blocks should generally not communicate with each other (with a few exceptions).\n",
    "\n",
    "Deciding the best size for the CUDA thread grid is a complex problem (and depends on both the algorithm and the specific GPU compute capability), but here are some very rough heuristics that we follow:\n",
    "\n",
    "  * the size of a block should be a multiple of 32 threads, with typical block sizes between 128 and 512 threads per block.\n",
    "  * the size of the grid should ensure the full GPU is utilized where possible.  Launching a grid where the number of blocks is 2x-4x the number of \"multiprocessors\" on the GPU is a good starting place.  Something in the range of 20 - 100 blocks is usually a good starting point.\n",
    "  * The CUDA kernel launch overhead does depend on the number of blocks, so we find it best not to launch a grid where the number of threads equals the number of input elements when the input size is very big.  We'll show a pattern for dealing with large inputs below.\n",
    "\n",
    "Each thread distinguishes itself from the other threads using its unique thread (`threadIdx`) and block (`blockIdx`) index values, which can be multidimensional if launched that way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A First Example\n",
    "\n",
    "This all will be a little overwhelming at first, so let's start with a concrete example.  Let's write our addition function for 1D NumPy arrays.  CUDA kernels are compiled using the `numba.cuda.jit` decorator (not to be confused with the `numba.jit` decorator for the CPU):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def add_kernel(x, y, out):\n",
    "    tx = cuda.threadIdx.x # this is the unique thread ID within a 1D block\n",
    "    ty = cuda.blockIdx.x  # Similarly, this is the unique block ID within the 1D grid\n",
    "\n",
    "    block_size = cuda.blockDim.x  # number of threads per block\n",
    "    grid_size = cuda.gridDim.x    # number of blocks in the grid\n",
    "    \n",
    "    start = tx + ty * block_size\n",
    "    stride = block_size * grid_size\n",
    "\n",
    "    # assuming x and y inputs are same length\n",
    "    for i in range(start, x.shape[0], stride):\n",
    "        out[i] = x[i] + y[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot more typing than our ufunc example, and it is much more limited: only works on 1D arrays, doesn't verify input sizes match, etc.  Most of the function is spent figuring out how to turn the block and grid indices and dimensions into unique offsets into the input arrays.  The pattern of computing a starting index and a stride is a common way to ensure that your grid size is independent of the input size.  The striding will maximize bandwidth by ensuring that threads with consecuitive indices are accessing consecutive memory locations as much as possible.  Thread indices beyond the length of the input (`x.shape[0]`, since `x` is a NumPy array) automatically skip over the for loop.\n",
    "\n",
    "Also note that we did not need to specify a type signature for the CUDA kernel.  Unlike `@vectorize`, Numba can infer the type signature from the inputs automatically, and much more reliably.\n",
    "\n",
    "Let's call the function now on some data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "n = 100000\n",
    "x = np.arange(n).astype(np.float32)\n",
    "y = 2 * x\n",
    "out = np.empty_like(x)\n",
    "\n",
    "threads_per_block = 128\n",
    "blocks_per_grid = 30\n",
    "\n",
    "add_kernel[blocks_per_grid, threads_per_block](x, y, out)\n",
    "print(out[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The unusual syntax for calling the kernel function is designed to mimic the CUDA Runtime API in C, where the above call would look like:\n",
    "```\n",
    "add_kernel<<<blocks_per_grid, threads_per_block>>>(x, y, out)\n",
    "```\n",
    "The arguments within the square brackets define the size and shape of the thread grid, and the arguments with parentheses correspond to the kernel function arguments.\n",
    "\n",
    "Note that, unlike the ufunc, the arguments are passed to the kernel as full NumPy arrays.  The kernel can access any element in the array it wants, regardless of its position in the thread grid.  This is why CUDA kernels are significantly more powerful that ufuncs.  (But with great power, comes a greater amount of typing...)\n",
    "\n",
    "Numba includes [several helper functions](http://numba.pydata.org/numba-doc/dev/cuda/kernels.html#absolute-positions) to simplify the thread offset calculations above.  You can write the function much more simply as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def add_kernel(x, y, out):\n",
    "    start = cuda.grid(1)      # 1 = one dimensional thread grid, returns a single value\n",
    "    stride = cuda.gridsize(1) # ditto\n",
    "\n",
    "    # assuming x and y inputs are same length\n",
    "    for i in range(start, x.shape[0], stride):\n",
    "        out[i] = x[i] + y[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, using NumPy arrays forces Numba to allocate GPU memory, copy the arguments to the GPU, run the kernel, then copy the argument arrays back to the host.  This not very efficient, so you will often want to allocate device arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x_device = cuda.to_device(x)\n",
    "y_device = cuda.to_device(y)\n",
    "out_device = cuda.device_array_like(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 44.29 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "100 loops, best of 3: 2.41 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_kernel[blocks_per_grid, threads_per_block](x, y, out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 203.24 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000 loops, best of 3: 492 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device); out_device.copy_to_host()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kernel Synchronization\n",
    "\n",
    "*One extremely important caveat should be mentioned here*: CUDA kernel execution is designed to be asynchronous with respect to the host program.  This means that the kernel launch (`add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)`) returns immediately, allowing the CPU to continue executing while the GPU works in the background.  Only host<->device memory copies or an explicit synchronization call will force the CPU to wait until previously queued CUDA kernels are complete.\n",
    "\n",
    "When you pass host NumPy arrays to a CUDA kernel, Numba has to synchronize on your behalf, but if you pass device arrays, processing will continue.  If you launch multiple kernels in sequence without any synchronization in between, they will be queued up to run sequentially by the driver, which is usually what you want.  If you want to run multiple kernels on the GPU in parallel (sometimes a good idea, but beware of race conditions!), take a look at [CUDA streams](http://numba.pydata.org/numba-doc/dev/cuda-reference/host.html?highlight=synchronize#stream-management).\n",
    "\n",
    "Here's some sample timings (using `%time`, which only runs the statement once to ensure our measurement isn't affected by the finite depth of the CUDA kernel queue):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.53 ms, sys: 1.11 ms, total: 3.64 ms\n",
      "Wall time: 2.47 ms\n"
     ]
    }
   ],
   "source": [
    "# CPU input/output arrays, implied synchronization for memory copies\n",
    "%time add_kernel[blocks_per_grid, threads_per_block](x, y, out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 467 µs, sys: 64 µs, total: 531 µs\n",
      "Wall time: 489 µs\n"
     ]
    }
   ],
   "source": [
    "# GPU input/output arrays, no synchronization (but force sync before and after)\n",
    "cuda.synchronize()\n",
    "%time add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)\n",
    "cuda.synchronize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.27 ms, sys: 546 µs, total: 1.82 ms\n",
      "Wall time: 1.01 ms\n"
     ]
    }
   ],
   "source": [
    "# GPU input/output arrays, include explicit synchronization in timing\n",
    "cuda.synchronize()\n",
    "%time add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Always be sure to synchronize with the GPU when benchmarking CUDA kernels!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Atomic Operations and Avoiding Race Conditions\n",
    "\n",
    "CUDA, like many general purpose parallel execution frameworks, makes it possible to have race condtions in your code.  A race condition in CUDA arises when threads read or write a memory location that might be modified by another independent thread.  Generally speaking, you need to worry about:\n",
    "\n",
    " * read-after-write hazards: One thread is reading a memory location at the same time another thread might be writing to it.\n",
    " * write-after-write hazards: Two threads are writing to the same memory location, and only one write will be visible when the kernel is complete.\n",
    " \n",
    "A common strategy to avoid both of these hazards is to organize your CUDA kernel algorithm such that each thread has exclusive responsibility for unique subsets of output array elements, and/or to never use the same array for both input and output in a single kernel call.  (Iterative algorithms can use a double-buffering strategy if needed, and switch input and output arrays on each iteration.)\n",
    "\n",
    "However, there are many cases where different threads need to combine results.  Consider something very simple, like: \"every thread increments a global counter.\"  Implementing this in your kernel requires each thread to:\n",
    "\n",
    "1. Read the current value of a global counter.\n",
    "2. Compute `counter + 1`.\n",
    "3. Write that value back to global memory.\n",
    "\n",
    "However, there is no guarantee that another thread has not changed the global counter between steps 1 and 3.  To resolve this problem, CUDA provides \"atomic operations\" which will read, modify and update a memory location in one, indivisible step.  Numba supports several of these functions, [described here](http://numba.pydata.org/numba-doc/dev/cuda/intrinsics.html#supported-atomic-operations).\n",
    "\n",
    "Let's make our thread counter kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def thread_counter_race_condition(global_counter):\n",
    "    global_counter[0] += 1  # This is bad\n",
    "    \n",
    "@cuda.jit\n",
    "def thread_counter_safe(global_counter):\n",
    "    cuda.atomic.add(global_counter, 0, 1)  # Safely add 1 to offset 0 in global_counter array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Should be 4096: [3]\n"
     ]
    }
   ],
   "source": [
    "# This gets the wrong answer\n",
    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
    "thread_counter_race_condition[64, 64](global_counter)\n",
    "\n",
    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Should be 4096: [4096]\n"
     ]
    }
   ],
   "source": [
    "# This works correctly\n",
    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
    "thread_counter_safe[64, 64](global_counter)\n",
    "\n",
    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Exercise\n",
    "\n",
    "For this exercise, create a histogramming kernel.  This will take an array of input data, a range and a number of bins, and count how many of the input data elements land in each bin.  Below is an example CPU implementation of histogramming:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cpu_histogram(x, xmin, xmax, histogram_out):\n",
    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
    "    # Note that we don't have to pass in nbins explicitly, because the size of histogram_out determines it\n",
    "    nbins = histogram_out.shape[0]\n",
    "    bin_width = (xmax - xmin) / nbins\n",
    "    \n",
    "    # This is a very slow way to do this with NumPy, but looks similar to what you will do on the GPU\n",
    "    for element in x:\n",
    "        bin_number = np.int32((element - xmin)/bin_width)\n",
    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
    "            # only increment if in range\n",
    "            histogram_out[bin_number] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([   5,   86,  494, 1562, 2891, 2879, 1527,  462,   88,    5], dtype=int32)"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = np.random.normal(size=10000, loc=0, scale=1).astype(np.float32)\n",
    "xmin = np.float32(-4.0)\n",
    "xmax = np.float32(4.0)\n",
    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
    "\n",
    "cpu_histogram(x, xmin, xmax, histogram_out)\n",
    "\n",
    "histogram_out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def cuda_histogram(x, xmin, xmax, histogram_out):\n",
    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
    "    \n",
    "    pass  # Replace this with your implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 5 - Troubleshooting and Debugging.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 5: Troubleshooting and Debugging\n",
    "\n",
    "## Note about the Terminal\n",
    "\n",
    "Debugging is an important part of programming.  Unfortuntely, it is pretty difficult to debug CUDA kernels directly in the Jupyter notebook for a variety of reasons, so this notebook will show terminal commands by executing Jupyter notebook cells using the shell.  These shell commands will appear in notebook cells with the command line prefixed by `!`. When applying the debug methods described in this notebook, you will likely run the commands in the terminal directly.\n",
    "\n",
    "## Printing\n",
    "\n",
    "A common debugging strategy is printing to the console.  Numba supports printing from CUDA kernels, with some restrictions.  Note that output printed from a CUDA kernel will not be captured by Jupyter, so you will need to debug with a script you can run from the terminal.\n",
    "\n",
    "Let's look at a CUDA kernel with a bug:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import numpy as np\r\n",
      "\r\n",
      "from numba import cuda\r\n",
      "\r\n",
      "@cuda.jit\r\n",
      "def histogram(x, xmin, xmax, histogram_out):\r\n",
      "    nbins = histogram_out.shape[0]\r\n",
      "    bin_width = (xmax - xmin) / nbins\r\n",
      "\r\n",
      "    start = cuda.grid(1)\r\n",
      "    stride = cuda.gridsize(1)\r\n",
      "\r\n",
      "    for i in range(start, x.shape[0], stride):\r\n",
      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
      "            histogram_out[bin_number] += 1\r\n",
      "\r\n",
      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
      "xmin = np.float32(-4.0)\r\n",
      "xmax = np.float32(4.0)\r\n",
      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
      "\r\n",
      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
      "\r\n",
      "print('input count:', x.shape[0])\r\n",
      "print('histogram:', histogram_out)\r\n",
      "print('count:', histogram_out.sum())\r\n"
     ]
    }
   ],
   "source": [
    "! cat debug/ex1.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "When we run this code to histogram 50 values, we see the histogram is not getting 50 entries: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input count: 50\r\n",
      "histogram: [0 0 1 1 1 1 1 1 0 0]\r\n",
      "count: 6\r\n"
     ]
    }
   ],
   "source": [
    "! python debug/ex1.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*(You might have already spotted the mistake, but let's pretend we don't know the answer.)*\n",
    "\n",
    "We hypothesize that maybe a bin calculation error is causing many of the histogram entries to appear out of range.  Let's add some printing around the `if` statement to show us what is going on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import numpy as np\r\n",
      "\r\n",
      "from numba import cuda\r\n",
      "\r\n",
      "@cuda.jit\r\n",
      "def histogram(x, xmin, xmax, histogram_out):\r\n",
      "    nbins = histogram_out.shape[0]\r\n",
      "    bin_width = (xmax - xmin) / nbins\r\n",
      "\r\n",
      "    start = cuda.grid(1)\r\n",
      "    stride = cuda.gridsize(1)\r\n",
      "\r\n",
      "    for i in range(start, x.shape[0], stride):\r\n",
      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
      "            histogram_out[bin_number] += 1\r\n",
      "            print('in range', x[i], bin_number)\r\n",
      "        else:\r\n",
      "            print('out of range', x[i], bin_number)\r\n",
      "\r\n",
      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
      "xmin = np.float32(-4.0)\r\n",
      "xmax = np.float32(4.0)\r\n",
      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
      "\r\n",
      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
      "\r\n",
      "print('input count:', x.shape[0])\r\n",
      "print('histogram:', histogram_out)\r\n",
      "print('count:', histogram_out.sum())\r\n"
     ]
    }
   ],
   "source": [
    "! cat debug/ex1a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This kernel will print every value and bin number it calculates.  Looking at one of the print statements, we see that `print` supports constant strings, and scalar values:\n",
    "\n",
    "``` python\n",
    "print('in range', x[i], bin_number)\n",
    "```\n",
    "\n",
    "String substitution (using C printf syntax or the newer `format()` syntax) is not supported.  If we run this script we see:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in range 1.674757 7\r\n",
      "in range -0.492113 4\r\n",
      "in range 0.526627 5\r\n",
      "in range 2.359267 7\r\n",
      "in range -1.768394 2\r\n",
      "in range 0.342256 5\r\n",
      "in range 0.793954 5\r\n",
      "in range -0.338127 4\r\n",
      "in range 1.275327 6\r\n",
      "in range -0.877891 3\r\n",
      "in range 0.922818 6\r\n",
      "in range 0.635215 5\r\n",
      "in range 0.371592 5\r\n",
      "in range 0.925639 6\r\n",
      "in range -1.116025 3\r\n",
      "in range 0.615792 5\r\n",
      "in range 0.879030 6\r\n",
      "in range 2.061845 7\r\n",
      "in range 0.037717 5\r\n",
      "in range -0.440858 4\r\n",
      "in range 1.056680 6\r\n",
      "in range -0.111198 4\r\n",
      "in range 0.452880 5\r\n",
      "in range -0.154099 4\r\n",
      "in range 0.518296 5\r\n",
      "in range 0.072946 5\r\n",
      "in range 1.209770 6\r\n",
      "in range -0.057651 4\r\n",
      "in range 0.154896 5\r\n",
      "in range 1.099341 6\r\n",
      "in range 0.271862 5\r\n",
      "in range 0.643499 5\r\n",
      "in range 0.824574 6\r\n",
      "in range 0.809260 6\r\n",
      "in range 0.354412 5\r\n",
      "in range -0.365111 4\r\n",
      "in range 0.594393 5\r\n",
      "in range 0.830470 6\r\n",
      "in range -0.402743 4\r\n",
      "in range -0.554546 4\r\n",
      "in range -0.507898 4\r\n",
      "in range -0.006359 4\r\n",
      "in range -0.316683 4\r\n",
      "in range 2.015556 7\r\n",
      "in range -1.288521 3\r\n",
      "in range 0.401858 5\r\n",
      "in range -1.410364 3\r\n",
      "in range -0.459540 4\r\n",
      "in range 0.633090 5\r\n",
      "in range -1.000390 3\r\n",
      "input count: 50\r\n",
      "histogram: [0 0 1 2 2 2 2 2 0 0]\r\n",
      "count: 11\r\n"
     ]
    }
   ],
   "source": [
    "! python debug/ex1a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scanning down that output, we see that all 50 values should be in range.  Clearly we have some kind of race condition updating the histogram.  In fact, the culprit line is:\n",
    "\n",
    "``` python\n",
    "histogram_out[bin_number] += 1\n",
    "```\n",
    "\n",
    "which should be (as you may have seen in a previous exercise)\n",
    "\n",
    "``` python\n",
    "cuda.atomic.add(histogram_out, bin_number, 1)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## CUDA Simulator\n",
    "\n",
    "Back in the early days of CUDA, `nvcc` had an \"emulator\" mode that would execute CUDA code on the CPU.  That functionality was dropped in later CUDA releases after `cuda-gdb` was created.  We missed emulator mode so much, Numba includes a \"CUDA simulator\" in Numba that runs your CUDA code with the Python interpreter on the host CPU.  This allows you to debug the logic of your code using Python modules and functions that would otherwise be not allowed by the compile.\n",
    "\n",
    "A very common use case is to start the Python debugger inside one thread of a CUDA kernel:\n",
    "``` python\n",
    "import numpy as np\n",
    "\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def histogram(x, xmin, xmax, histogram_out):\n",
    "    nbins = histogram_out.shape[0]\n",
    "    bin_width = (xmax - xmin) / nbins\n",
    "\n",
    "    start = cuda.grid(1)\n",
    "    stride = cuda.gridsize(1)\n",
    "\n",
    "    ### DEBUG FIRST THREAD\n",
    "    if start == 0:\n",
    "        from pdb import set_trace; set_trace()\n",
    "    ###\n",
    "\n",
    "    for i in range(start, x.shape[0], stride):\n",
    "        bin_number = np.int32((x[i] + xmin)/bin_width)\n",
    "\n",
    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
    "            cuda.atomic.add(histogram_out, bin_number, 1)\n",
    "\n",
    "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\n",
    "xmin = np.float32(-4.0)\n",
    "xmax = np.float32(4.0)\n",
    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
    "\n",
    "histogram[64, 64](x, xmin, xmax, histogram_out)\n",
    "\n",
    "print('input count:', x.shape[0])\n",
    "print('histogram:', histogram_out)\n",
    "print('count:', histogram_out.sum())\n",
    "```\n",
    "\n",
    "This code allows a debug session like the following to take place:\n",
    "```\n",
    "(gtc2017) 0179-sseibert:gtc2017-numba sseibert$ NUMBA_ENABLE_CUDASIM=1 python debug/ex2.py\n",
    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(18)histogram()\n",
    "-> for i in range(start, x.shape[0], stride):\n",
    "(Pdb) n\n",
    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(19)histogram()\n",
    "-> bin_number = np.int32((x[i] + xmin)/bin_width)\n",
    "(Pdb) n\n",
    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(21)histogram()\n",
    "-> if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
    "(Pdb) p bin_number, x[i]\n",
    "(-6, -1.4435024)\n",
    "(Pdb) p x[i], xmin, bin_width\n",
    "(-1.4435024, -4.0, 0.80000000000000004)\n",
    "(Pdb) p (x[i] - xmin) / bin_width\n",
    "3.1956219673156738\n",
    "(Pdb) q\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## CUDA Memcheck\n",
    "\n",
    "Another common error occurs when a CUDA kernel has an invalid memory access, typically caused by running off the end of an array.  The full CUDA toolkit from NVIDIA (not the `cudatoolkit` conda package) contain a utility called `cuda-memcheck` that can check for a wide range of memory access mistakes in CUDA code.\n",
    "\n",
    "Let's debug the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import numpy as np\r\n",
      "\r\n",
      "from numba import cuda\r\n",
      "\r\n",
      "@cuda.jit\r\n",
      "def histogram(x, xmin, xmax, histogram_out):\r\n",
      "    nbins = histogram_out.shape[0]\r\n",
      "    bin_width = (xmax - xmin) / nbins\r\n",
      "\r\n",
      "    start = cuda.grid(1)\r\n",
      "    stride = cuda.gridsize(1)\r\n",
      "\r\n",
      "    for i in range(start, x.shape[0], stride):\r\n",
      "        bin_number = np.int32((x[i] + xmin)/bin_width)\r\n",
      "\r\n",
      "        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
      "            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
      "\r\n",
      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
      "xmin = np.float32(-4.0)\r\n",
      "xmax = np.float32(4.0)\r\n",
      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
      "\r\n",
      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
      "\r\n",
      "print('input count:', x.shape[0])\r\n",
      "print('histogram:', histogram_out)\r\n",
      "print('count:', histogram_out.sum())\r\n"
     ]
    }
   ],
   "source": [
    "! cat debug/ex3.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= CUDA-MEMCHECK\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (49,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (48,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (47,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (46,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (45,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (44,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (43,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (42,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (41,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (40,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (39,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (38,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (37,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (36,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (35,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (34,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (33,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (32,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (31,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (30,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (29,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (28,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (27,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (26,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (25,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (24,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (23,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (22,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (21,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (20,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (19,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (18,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (17,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (16,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (15,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (14,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (13,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (12,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (11,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (10,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (9,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (8,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (7,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (6,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (5,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (4,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (3,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (2,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (1,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (0,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
      "=========     Saved host backtrace up to driver entry point at error\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuMemcpyDtoH_v2 + 0x184) [0x158214]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========     Host Frame:[0x7fff5b7db780]\n",
      "=========\n",
      "Traceback (most recent call last):\n",
      "  File \"debug/ex3.py\", line 24, in <module>\n",
      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 703, in __call__\n",
      "    cfg(*args)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 483, in __call__\n",
      "    sharedmem=self.sharedmem)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 585, in _kernel_call\n",
      "    wb()\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 600, in <lambda>\n",
      "    retr.append(lambda: devary.copy_to_host(val, stream=stream))\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py\", line 198, in copy_to_host\n",
      "    _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1484, in device_to_host\n",
      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 262, in safe_cuda_api_call\n",
      "    self._check_error(fname, retcode)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 299, in _check_error\n",
      "    raise CudaAPIError(retcode, msg)\n",
      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= ERROR SUMMARY: 51 errors\r\n"
     ]
    }
   ],
   "source": [
    "! cuda-memcheck python debug/ex3.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of `cuda-memcheck` is clearly showing a problem with our histogram function:\n",
    "```\n",
    "========= Invalid __global__ write of size 4\n",
    "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
    "```\n",
    "But we don't know which line it is.  To get better error information, we can turn \"debug\" mode on when compiling the kernel, by changing the kernel to look like this:\n",
    "``` python\n",
    "@cuda.jit(debug=True)\n",
    "def histogram(x, xmin, xmax, histogram_out):\n",
    "    nbins = histogram_out.shape[0]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= CUDA-MEMCHECK\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (49,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (48,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (47,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (46,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (45,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (44,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (43,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (42,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (41,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (40,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (39,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (38,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (37,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (36,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (35,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs

Download .txt

gitextract_ini3svql/

├── 1 - Numba Basics.ipynb
├── 2 - CUDA Basics.ipynb
├── 3 - Memory Management.ipynb
├── 4 - Writing CUDA Kernels.ipynb
├── 5 - Troubleshooting and Debugging.ipynb
├── 6 - Extra Topics.ipynb
├── README.md
├── debug/
│   ├── ex1.py
│   ├── ex1a.py
│   ├── ex2.py
│   ├── ex3.py
│   └── ex3a.py
└── docker/
    ├── README.md
    ├── base/
    │   └── Dockerfile
    └── notebooks/
        └── Dockerfile

Download .txt

SYMBOL INDEX (5 symbols across 5 files)

FILE: debug/ex1.py
  function histogram (line 6) | def histogram(x, xmin, xmax, histogram_out):

FILE: debug/ex1a.py
  function histogram (line 6) | def histogram(x, xmin, xmax, histogram_out):

FILE: debug/ex2.py
  function histogram (line 6) | def histogram(x, xmin, xmax, histogram_out):

FILE: debug/ex3.py
  function histogram (line 6) | def histogram(x, xmin, xmax, histogram_out):

FILE: debug/ex3a.py
  function histogram (line 6) | def histogram(x, xmin, xmax, histogram_out):

Download .json

Condensed preview — 15 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (269K chars).

[
  {
    "path": "1 - Numba Basics.ipynb",
    "chars": 37632,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# GTC 2017 Numba Tutorial Notebook "
  },
  {
    "path": "2 - CUDA Basics.ipynb",
    "chars": 50030,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# GTC 2017 Numba Tutorial Notebook "
  },
  {
    "path": "3 - Memory Management.ipynb",
    "chars": 24460,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# GTC 2017 Numba Tutorial Notebook "
  },
  {
    "path": "4 - Writing CUDA Kernels.ipynb",
    "chars": 18520,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# GTC 2017 Numba Tutorial Notebook "
  },
  {
    "path": "5 - Troubleshooting and Debugging.ipynb",
    "chars": 102877,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# GTC 2017 Numba Tutorial Notebook "
  },
  {
    "path": "6 - Extra Topics.ipynb",
    "chars": 14221,
    "preview": "{\n \"cells\": [\n  {\n   \"cell_type\": \"markdown\",\n   \"metadata\": {},\n   \"source\": [\n    \"# GTC 2017 Numba Tutorial Notebook "
  },
  {
    "path": "README.md",
    "chars": 55,
    "preview": "# gtc2017-numba\nNumba tutorial for GTC 2017 conference\n"
  },
  {
    "path": "debug/ex1.py",
    "chars": 762,
    "preview": "import numpy as np\n\nfrom numba import cuda\n\n@cuda.jit\ndef histogram(x, xmin, xmax, histogram_out):\n    nbins = histogram"
  },
  {
    "path": "debug/ex1a.py",
    "chars": 876,
    "preview": "import numpy as np\n\nfrom numba import cuda\n\n@cuda.jit\ndef histogram(x, xmin, xmax, histogram_out):\n    nbins = histogram"
  },
  {
    "path": "debug/ex2.py",
    "chars": 880,
    "preview": "import numpy as np\n\nfrom numba import cuda\n\n@cuda.jit\ndef histogram(x, xmin, xmax, histogram_out):\n    nbins = histogram"
  },
  {
    "path": "debug/ex3.py",
    "chars": 777,
    "preview": "import numpy as np\n\nfrom numba import cuda\n\n@cuda.jit\ndef histogram(x, xmin, xmax, histogram_out):\n    nbins = histogram"
  },
  {
    "path": "debug/ex3a.py",
    "chars": 789,
    "preview": "import numpy as np\n\nfrom numba import cuda\n\n@cuda.jit(debug=True)\ndef histogram(x, xmin, xmax, histogram_out):\n    nbins"
  },
  {
    "path": "docker/README.md",
    "chars": 485,
    "preview": "# Docker Instructions\n\nTo build the images:\n\n```bash\ndocker build -t conda_cuda_base:latest ./base\ndocker build -t numba"
  },
  {
    "path": "docker/base/Dockerfile",
    "chars": 728,
    "preview": "# Build with:\n#     docker build -t conda_cuda_base:latest .\nFROM nvidia/cuda:8.0-devel-ubuntu14.04\n\nRUN apt-get update\n"
  },
  {
    "path": "docker/notebooks/Dockerfile",
    "chars": 373,
    "preview": "# Build with:\n#     docker build -t numba_gtc2017:latest .\n# Run with:\n#     nvidia-docker run -p 8888:8888 -it numba_gt"
  }
]

About this extraction

This page contains the full source code of the ContinuumIO/gtc2017-numba GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 15 files (247.5 KB), approximately 102.6k tokens, and a symbol index with 5 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo