Repository: ContinuumIO/gtc2017-numba
Branch: master
Commit: 6ddaeec9baec
Files: 15
Total size: 247.5 KB

Directory structure:
gitextract_ini3svql/

├── 1 - Numba Basics.ipynb
├── 2 - CUDA Basics.ipynb
├── 3 - Memory Management.ipynb
├── 4 - Writing CUDA Kernels.ipynb
├── 5 - Troubleshooting and Debugging.ipynb
├── 6 - Extra Topics.ipynb
├── README.md
├── debug/
│   ├── ex1.py
│   ├── ex1a.py
│   ├── ex2.py
│   ├── ex3.py
│   └── ex3a.py
└── docker/
    ├── README.md
    ├── base/
    │   └── Dockerfile
    └── notebooks/
        └── Dockerfile

================================================
FILE CONTENTS
================================================

================================================
FILE: 1 - Numba Basics.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 1: Numba Basics\n",
    "\n",
    "## What is Numba?\n",
    "\n",
    "Numba is a **just-in-time**, **type-specializing**, **function compiler** for accelerating **numerically-focused** Python.  That's a long list, so let's break down those terms:\n",
    "\n",
    " * **function compiler**: Numba compiles Python functions, not entire applications, and not parts of functions.  Numba does not replace your Python interpreter, but is just another Python module that can turn a function into a (usually) faster function. \n",
    " * **type-specializing**: Numba speeds up your function by generating a specialized implementation for the specific data types you are using.  Python functions are designed to operate on generic data types, which makes them very flexible, but also very slow.  In practice, you only will call a function with a small number of argument types, so Numba will generate a fast implementation for each set of types.\n",
    " * **just-in-time**: Numba translates functions when they are first called.  This ensures the compiler knows what argument types you will be using.  This also allows Numba to be used interactively in a Jupyter notebook just as easily as a traditional application\n",
    " * **numerically-focused**: Currently, Numba is focused on numerical data types, like `int`, `float`, and `complex`.  There is very limited string processing support, and many string use cases are not going to work well on the GPU.  To get best results with Numba, you will likely be using NumPy arrays.\n",
    "\n",
    "## Requirements\n",
    "\n",
    "Numba supports a wide range of operating systems:\n",
    "\n",
    " * Windows 7 and later, 32 and 64-bit\n",
    " * macOS 10.9 and later, 64-bit\n",
    " * Linux (most anything >= RHEL 5), 32-bit and 64-bit\n",
    "\n",
    "and Python versions:\n",
    "\n",
    " * Python 2.7, 3.3-3.6\n",
    " * NumPy 1.8 and later\n",
    "\n",
    "and a very wide range of hardware:\n",
    "\n",
    "* x86, x86_64/AMD64 CPUs\n",
    "* NVIDIA CUDA GPUs (Compute capability 3.0 and later, CUDA 7.5 and later)\n",
    "* AMD GPUs (experimental patches)\n",
    "* ARM (experimental patches)\n",
    "\n",
    "For this tutorial, we will be using Linux 64-bit and CUDA 8."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## First Steps\n",
    "\n",
    "Let's write our first Numba function and compile it for the **CPU**.  The Numba compiler is typically enabled by applying a *decorator* to a Python function.  Decorators are functions that transform Python functions.  Here we will use the CPU compilation decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numba import jit\n",
    "import math\n",
    "\n",
    "@jit\n",
    "def hypot(x, y):\n",
    "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
    "    x = abs(x);\n",
    "    y = abs(y);\n",
    "    t = min(x, y);\n",
    "    x = max(x, y);\n",
    "    t = t / x;\n",
    "    return x * math.sqrt(1+t*t)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The above code is equivalent to writing:\n",
    "``` python\n",
    "def hypot(x, y):\n",
    "    x = abs(x);\n",
    "    y = abs(y);\n",
    "    t = min(x, y);\n",
    "    x = max(x, y);\n",
    "    t = t / x;\n",
    "    return x * math.sqrt(1+t*t)\n",
    "    \n",
    "hypot = jit(hypot)\n",
    "```\n",
    "This means that the Numba compiler is just a function you can call whenever you want!\n",
    "\n",
    "Let's try out our hypotenuse calculation:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hypot(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The first time we call `hypot`, the compiler is triggered and compiles a machine code implementation for float inputs.  Numba also saves the original Python implementation of the function in the `.py_func` attribute, so we can call the original Python code to make sure we get the same answer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "hypot.py_func(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Benchmarking\n",
    "\n",
    "An important part of using Numba is measuring the performance of your new code.  Let's see if we actually sped anything up.  The easiest way to do this in the Jupyter notebook is to use the `%timeit` magic function.  Let's first measure the speed of the original Python:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 6.56 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000000 loops, best of 3: 893 ns per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit hypot.py_func(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `%timeit` magic runs the statement many times to get an accurate estimate of the run time.  It also returns the best time by default, which is useful to reduce the probability that random background events affect your measurement.  The best of 3 approach also ensures that the compilation time on the first call doesn't skew the results:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 24.98 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 176 ns per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit hypot(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Numba did a pretty good job with this function.  It's more than 4x faster than the pure Python version.\n",
    "\n",
    "Of course, the `hypot` function is already present in the Python module:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 72.16 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10000000 loops, best of 3: 138 ns per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit math.hypot(3.0, 4.0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Python's built-in is even faster than Numba!  This is because Numba does introduce some overhead to each function call that is larger than the function call overhead of Python itself.  Extremely fast functions (like the above one) will be hurt by this.\n",
    "\n",
    "(However, if you call one Numba function from another one, there is very little function overhead, sometimes even zero if the compiler inlines the function into the other one.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## How does Numba work?\n",
    "\n",
    "The first time we called our Numba-wrapped `hypot` function, the following process was initiated:\n",
    "\n",
    "![Numba Flowchart](img/numba_flowchart.png \"The compilation process\")\n",
    "\n",
    "We can see the result of type inference by using the `.inspect_types()` method, which prints an annotated version of the source code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hypot (float64, float64)\n",
      "--------------------------------------------------------------------------------\n",
      "# File: <ipython-input-3-d433ca923bcf>\n",
      "# --- LINE 4 --- \n",
      "# label 0\n",
      "#   del x\n",
      "#   del $0.1\n",
      "#   del $0.3\n",
      "#   del y\n",
      "#   del $0.4\n",
      "#   del $0.6\n",
      "#   del $0.7\n",
      "#   del $0.10\n",
      "#   del y.1\n",
      "#   del x.1\n",
      "#   del $0.11\n",
      "#   del $0.14\n",
      "#   del t\n",
      "#   del $0.17\n",
      "#   del $0.19\n",
      "#   del t.1\n",
      "#   del $const0.21\n",
      "#   del $0.24\n",
      "#   del $0.25\n",
      "#   del $0.20\n",
      "#   del x.2\n",
      "#   del $0.26\n",
      "#   del $0.27\n",
      "\n",
      "@jit\n",
      "\n",
      "# --- LINE 5 --- \n",
      "\n",
      "def hypot(x, y):\n",
      "\n",
      "    # --- LINE 6 --- \n",
      "\n",
      "    # Implementation from https://en.wikipedia.org/wiki/Hypot\n",
      "\n",
      "    # --- LINE 7 --- \n",
      "    #   x = arg(0, name=x)  :: float64\n",
      "    #   y = arg(1, name=y)  :: float64\n",
      "    #   $0.1 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
      "    #   $0.3 = call $0.1(x)  :: (float64,) -> float64\n",
      "    #   x.1 = $0.3  :: float64\n",
      "\n",
      "    x = abs(x);\n",
      "\n",
      "    # --- LINE 8 --- \n",
      "    #   $0.4 = global(abs: <built-in function abs>)  :: Function(<built-in function abs>)\n",
      "    #   $0.6 = call $0.4(y)  :: (float64,) -> float64\n",
      "    #   y.1 = $0.6  :: float64\n",
      "\n",
      "    y = abs(y);\n",
      "\n",
      "    # --- LINE 9 --- \n",
      "    #   $0.7 = global(min: <built-in function min>)  :: Function(<built-in function min>)\n",
      "    #   $0.10 = call $0.7(x.1, y.1)  :: (float64, float64) -> float64\n",
      "    #   t = $0.10  :: float64\n",
      "\n",
      "    t = min(x, y);\n",
      "\n",
      "    # --- LINE 10 --- \n",
      "    #   $0.11 = global(max: <built-in function max>)  :: Function(<built-in function max>)\n",
      "    #   $0.14 = call $0.11(x.1, y.1)  :: (float64, float64) -> float64\n",
      "    #   x.2 = $0.14  :: float64\n",
      "\n",
      "    x = max(x, y);\n",
      "\n",
      "    # --- LINE 11 --- \n",
      "    #   $0.17 = t / x.2  :: float64\n",
      "    #   t.1 = $0.17  :: float64\n",
      "\n",
      "    t = t / x;\n",
      "\n",
      "    # --- LINE 12 --- \n",
      "    #   $0.19 = global(math: <module 'math' from '/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/math.cpython-36m-darwin.so'>)  :: Module(<module 'math' from '/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/math.cpython-36m-darwin.so'>)\n",
      "    #   $0.20 = getattr(value=$0.19, attr=sqrt)  :: Function(<built-in function sqrt>)\n",
      "    #   $const0.21 = const(int, 1)  :: int64\n",
      "    #   $0.24 = t.1 * t.1  :: float64\n",
      "    #   $0.25 = $const0.21 + $0.24  :: float64\n",
      "    #   $0.26 = call $0.20($0.25)  :: (float64,) -> float64\n",
      "    #   $0.27 = x.2 * $0.26  :: float64\n",
      "    #   $0.28 = cast(value=$0.27)  :: float64\n",
      "    #   return $0.28\n",
      "\n",
      "    return x * math.sqrt(1+t*t)\n",
      "\n",
      "\n",
      "================================================================================\n"
     ]
    }
   ],
   "source": [
    "hypot.inspect_types()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that Numba's type names tend to mirror the NumPy type names, so a Python `float` is a `float64` (also called \"double precision\" in other languages).  Taking a look at the data types can sometimes be important in GPU code because the performance of `float32` and `float64` computations will be very different on CUDA devices.  An accidental upcast can dramatically slow down a function."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## When Things Go Wrong\n",
    "\n",
    "Numba cannot compile all Python code.  Some functions don't have a Numba-translation, and some kinds of Python types can't be efficiently compiled at all (yet).  For example, Numba does not support dictionaries (as of this tutorial):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'value'"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@jit\n",
    "def cannot_compile(x):\n",
    "    return x['key']\n",
    "\n",
    "cannot_compile(dict(key='value'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wait, what happened??  By default, Numba will fall back to a mode, called \"object mode,\" which does not do type-specialization.  Object mode exists to enable other Numba functionality, but in many cases, you want Numba to tell you if type inference fails.  You can force \"nopython mode\" (the other compilation mode) by passing arguments to the decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "ename": "TypingError",
     "evalue": "Failed at nopython (nopython frontend)\nInvalid usage of getitem with parameters (pyobject, const('key'))\n * parameterized\nFile \"<ipython-input-21-d3b98ca43e8a>\", line 3\n[1] During: typing of intrinsic-call at <ipython-input-21-d3b98ca43e8a> (3)\n[2] During: typing of static-get-item at <ipython-input-21-d3b98ca43e8a> (3)\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mTypingError\u001b[0m                               Traceback (most recent call last)",
      "\u001b[0;32m<ipython-input-21-d3b98ca43e8a>\u001b[0m in \u001b[0;36m<module>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      3\u001b[0m     \u001b[0;32mreturn\u001b[0m \u001b[0mx\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;34m'key'\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m      4\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m----> 5\u001b[0;31m \u001b[0mcannot_compile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mdict\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mkey\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m'value'\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    328\u001b[0m                                 for i, err in failed_args))\n\u001b[1;32m    329\u001b[0m                 \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpatch_message\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 330\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    331\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    332\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0minspect_llvm\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msignature\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36m_compile_for_args\u001b[0;34m(self, *args, **kws)\u001b[0m\n\u001b[1;32m    305\u001b[0m                 \u001b[0margtypes\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypeof_pyval\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0ma\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    306\u001b[0m         \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 307\u001b[0;31m             \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtuple\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margtypes\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    308\u001b[0m         \u001b[0;32mexcept\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mTypingError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    309\u001b[0m             \u001b[0;31m# Intercept typing error that may be due to an argument\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36mcompile\u001b[0;34m(self, sig)\u001b[0m\n\u001b[1;32m    576\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    577\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_cache_misses\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0msig\u001b[0m\u001b[0;34m]\u001b[0m \u001b[0;34m+=\u001b[0m \u001b[0;36m1\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 578\u001b[0;31m                 \u001b[0mcres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compiler\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_type\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    579\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_overload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mcres\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    580\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_cache\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0msave_overload\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0msig\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcres\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/dispatcher.py\u001b[0m in \u001b[0;36mcompile\u001b[0;34m(self, args, return_type)\u001b[0m\n\u001b[1;32m     78\u001b[0m                                       \u001b[0mimpl\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     79\u001b[0m                                       \u001b[0margs\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mreturn_type\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mreturn_type\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m---> 80\u001b[0;31m                                       flags=flags, locals=self.locals)\n\u001b[0m\u001b[1;32m     81\u001b[0m         \u001b[0;31m# Check typing error if object mode is used\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m     82\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mcres\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtyping_error\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m \u001b[0;32mand\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0mflags\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0menable_pyobject\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mcompile_extra\u001b[0;34m(typingctx, targetctx, func, args, return_type, flags, locals, library)\u001b[0m\n\u001b[1;32m    702\u001b[0m     pipeline = Pipeline(typingctx, targetctx, library,\n\u001b[1;32m    703\u001b[0m                         args, return_type, flags, locals)\n\u001b[0;32m--> 704\u001b[0;31m     \u001b[0;32mreturn\u001b[0m \u001b[0mpipeline\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mcompile_extra\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    705\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    706\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mcompile_extra\u001b[0;34m(self, func)\u001b[0m\n\u001b[1;32m    355\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlifted\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    356\u001b[0m         \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mlifted_from\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 357\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compile_bytecode\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    358\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    359\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mcompile_ir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfunc_ir\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlifted\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mlifted_from\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0;32mNone\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36m_compile_bytecode\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    663\u001b[0m         \"\"\"\n\u001b[1;32m    664\u001b[0m         \u001b[0;32massert\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfunc_ir\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 665\u001b[0;31m         \u001b[0;32mreturn\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0m_compile_core\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    666\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    667\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m_compile_ir\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36m_compile_core\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    650\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    651\u001b[0m         \u001b[0mpm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfinalize\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 652\u001b[0;31m         \u001b[0mres\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mpm\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mrun\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mstatus\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    653\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0mres\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    654\u001b[0m             \u001b[0;31m# Early pipeline completion\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, status)\u001b[0m\n\u001b[1;32m    241\u001b[0m                     \u001b[0;31m# No more fallback pipelines?\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    242\u001b[0m                     \u001b[0;32mif\u001b[0m \u001b[0mis_final_pipeline\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 243\u001b[0;31m                         \u001b[0;32mraise\u001b[0m \u001b[0mpatched_exception\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    244\u001b[0m                     \u001b[0;31m# Go to next fallback pipeline\u001b[0m\u001b[0;34m\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    245\u001b[0m                     \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mrun\u001b[0;34m(self, status)\u001b[0m\n\u001b[1;32m    233\u001b[0m                 \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    234\u001b[0m                     \u001b[0mevent\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mstage_name\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 235\u001b[0;31m                     \u001b[0mstage\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    236\u001b[0m                 \u001b[0;32mexcept\u001b[0m \u001b[0m_EarlyPipelineCompletion\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    237\u001b[0m                     \u001b[0;32mreturn\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresult\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mstage_nopython_frontend\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m    447\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0margs\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    448\u001b[0m                 \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturn_type\u001b[0m\u001b[0;34m,\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 449\u001b[0;31m                 self.locals)\n\u001b[0m\u001b[1;32m    450\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    451\u001b[0m         with self.fallback_context('Function \"%s\" has invalid return type'\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/compiler.py\u001b[0m in \u001b[0;36mtype_inference_stage\u001b[0;34m(typingctx, interp, args, return_type, locals)\u001b[0m\n\u001b[1;32m    803\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    804\u001b[0m         \u001b[0minfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mbuild_constraint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 805\u001b[0;31m         \u001b[0minfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mpropagate\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    806\u001b[0m         \u001b[0mtypemap\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mrestype\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mcalltypes\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0minfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0munify\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    807\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36mpropagate\u001b[0;34m(self, raise_errors)\u001b[0m\n\u001b[1;32m    765\u001b[0m         \u001b[0;32mif\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    766\u001b[0m             \u001b[0;32mif\u001b[0m \u001b[0mraise_errors\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 767\u001b[0;31m                 \u001b[0;32mraise\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0;36m0\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    768\u001b[0m             \u001b[0;32melse\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    769\u001b[0m                 \u001b[0;32mreturn\u001b[0m \u001b[0merrors\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36mpropagate\u001b[0;34m(self, typeinfer)\u001b[0m\n\u001b[1;32m    126\u001b[0m                                                    lineno=loc.line):\n\u001b[1;32m    127\u001b[0m                 \u001b[0;32mtry\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 128\u001b[0;31m                     \u001b[0mconstraint\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtypeinfer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    129\u001b[0m                 \u001b[0;32mexcept\u001b[0m \u001b[0mTypingError\u001b[0m \u001b[0;32mas\u001b[0m \u001b[0me\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    130\u001b[0m                     \u001b[0merrors\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mappend\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0me\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, typeinfer)\u001b[0m\n\u001b[1;32m    322\u001b[0m                     \u001b[0mtypeinfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mitemty\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    323\u001b[0m                 \u001b[0;32melif\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfallback\u001b[0m \u001b[0;32mis\u001b[0m \u001b[0;32mnot\u001b[0m \u001b[0;32mNone\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 324\u001b[0;31m                     \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfallback\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtypeinfer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    325\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    326\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0mget_call_signature\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36m__call__\u001b[0;34m(self, typeinfer)\u001b[0m\n\u001b[1;32m    435\u001b[0m     \u001b[0;32mdef\u001b[0m \u001b[0m__call__\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtypeinfer\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    436\u001b[0m         \u001b[0;32mwith\u001b[0m \u001b[0mnew_error_context\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m\"typing of intrinsic-call at {0}\"\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m:\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 437\u001b[0;31m             \u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mresolve\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mtypeinfer\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mtypeinfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtypevars\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mfnty\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mfunc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    438\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    439\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;32m/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/typeinfer.py\u001b[0m in \u001b[0;36mresolve\u001b[0;34m(self, typeinfer, typevars, fnty)\u001b[0m\n\u001b[1;32m    399\u001b[0m             \u001b[0mdesc\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0mcontext\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mexplain_function_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mfnty\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    400\u001b[0m             \u001b[0mmsg\u001b[0m \u001b[0;34m=\u001b[0m \u001b[0;34m'\\n'\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mjoin\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0;34m[\u001b[0m\u001b[0mhead\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mdesc\u001b[0m\u001b[0;34m]\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0;32m--> 401\u001b[0;31m             \u001b[0;32mraise\u001b[0m \u001b[0mTypingError\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mmsg\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n\u001b[0m\u001b[1;32m    402\u001b[0m \u001b[0;34m\u001b[0m\u001b[0m\n\u001b[1;32m    403\u001b[0m         \u001b[0mtypeinfer\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0madd_type\u001b[0m\u001b[0;34m(\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mtarget\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0msig\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mreturn_type\u001b[0m\u001b[0;34m,\u001b[0m \u001b[0mloc\u001b[0m\u001b[0;34m=\u001b[0m\u001b[0mself\u001b[0m\u001b[0;34m.\u001b[0m\u001b[0mloc\u001b[0m\u001b[0;34m)\u001b[0m\u001b[0;34m\u001b[0m\u001b[0m\n",
      "\u001b[0;31mTypingError\u001b[0m: Failed at nopython (nopython frontend)\nInvalid usage of getitem with parameters (pyobject, const('key'))\n * parameterized\nFile \"<ipython-input-21-d3b98ca43e8a>\", line 3\n[1] During: typing of intrinsic-call at <ipython-input-21-d3b98ca43e8a> (3)\n[2] During: typing of static-get-item at <ipython-input-21-d3b98ca43e8a> (3)\n\nThis error may have been caused by the following argument(s):\n- argument 0: cannot determine Numba type of <class 'dict'>\n"
     ]
    }
   ],
   "source": [
    "@jit(nopython=True)\n",
    "def cannot_compile(x):\n",
    "    return x['key']\n",
    "\n",
    "cannot_compile(dict(key='value'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we get an exception when Numba tries to compile the function, with an error that says:\n",
    "```\n",
    "- argument 0: cannot determine Numba type of <class 'dict'>\n",
    "```\n",
    "which is the underlying problem.\n",
    "\n",
    "We will see other `@jit` decorator arguments in future sections."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "\n",
    "Below is a function that loops over two input NumPy arrays and puts their sum into the output array.  Modify this function to call the `hypot` function we defined above.  We will learn a more efficient way to write such functions in a future section.\n",
    "\n",
    "(Make sure to execute all the cells in this notebook so that `hypot` is defined.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@jit(nopython=True)\n",
    "def ex1(x, y, out):\n",
    "    for i in range(x.shape[0]):\n",
    "        out[i] = x[i] + y[i]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in1: [ 0.  1.  2.  3.  4.  5.  6.  7.  8.  9.]\n",
      "in2: [  1.   3.   5.   7.   9.  11.  13.  15.  17.  19.]\n",
      "out: [  1.   4.   7.  10.  13.  16.  19.  22.  25.  28.]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "in1 = np.arange(10, dtype=np.float64)\n",
    "in2 = 2 * in1 + 1\n",
    "out = np.empty_like(in1)\n",
    "\n",
    "print('in1:', in1)\n",
    "print('in2:', in2)\n",
    "\n",
    "ex1(in1, in2, out)\n",
    "\n",
    "print('out:', out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This test will fail until you fix the ex1 function\n",
    "np.testing.assert_almost_equal(out, np.hypot(in1, in2))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 2 - CUDA Basics.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 2: CUDA Basics\n",
    "\n",
    "There are two basic approaches to GPU programming in Numba:\n",
    "\n",
    " 1. ufuncs/gufuncs (subject of this section)\n",
    " 2. CUDA Python kernels (subject of next section)\n",
    " \n",
    "We will not go into the CUDA hardware too much in this tutorial, but the most important thing to remember is that the hardware is designed for *data parallelism*.  Maximum throughput is achieved when you are computing the same operations on many different elements at once.  \n",
    "\n",
    "Universal functions are naturally data parallel, so we will begin with them."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Universal Functions\n",
    "\n",
    "NumPy has the concept of universal functions (\"ufuncs\"), which are functions that can take NumPy arrays of varying dimensions (or scalars) and operate on them element-by-element.\n",
    "\n",
    "It is probably easiest to show what happens by example.  We'll use the NumPy `add` ufunc to demonstrate what happens:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([11, 22, 33, 44])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "a = np.array([1, 2, 3, 4])\n",
    "b = np.array([10, 20, 30, 40])\n",
    "\n",
    "np.add(a, b)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Ufuncs also can combine scalars with arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([101, 102, 103, 104])"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.add(a, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Arrays of different, but compatible dimensions can also be combined.  The lower dimensional array will be replicated to match the dimensionality of the higher dimensional array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "c: [[ 0  1  2  3]\n",
      " [ 4  5  6  7]\n",
      " [ 8  9 10 11]\n",
      " [12 13 14 15]]\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "array([[10, 21, 32, 43],\n",
       "       [14, 25, 36, 47],\n",
       "       [18, 29, 40, 51],\n",
       "       [22, 33, 44, 55]])"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c = np.arange(4*4).reshape((4,4))\n",
    "print('c:', c)\n",
    "\n",
    "np.add(b, c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In the above situation, the `b` array is added to each row of `c`.  If we want to add `b` to each column, we need to transpose it.  There are several ways to do this, but one way is to insert a new axis using `np.newaxis`:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10],\n",
       "       [20],\n",
       "       [30],\n",
       "       [40]])"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b_col = b[:, np.newaxis]\n",
    "b_col"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[10, 11, 12, 13],\n",
       "       [24, 25, 26, 27],\n",
       "       [38, 39, 40, 41],\n",
       "       [52, 53, 54, 55]])"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "np.add(b_col, c)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The NumPy documentation has a much more extensive discussion of ufuncs:\n",
    "\n",
    "https://docs.scipy.org/doc/numpy/reference/ufuncs.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Making ufuncs for the GPU\n",
    "\n",
    "Numba has the ability to create compiled ufuncs.  You implement a scalar function of all the inputs, and Numba will figure out the broadcast rules for you.  Generating a ufunc that uses CUDA requires giving an explicit type signature and setting the `target` attribute:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import vectorize\n",
    "\n",
    "@vectorize(['int64(int64, int64)'], target='cuda')\n",
    "def add_ufunc(x, y):\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "a+b:\n",
      " [11 22 33 44]\n",
      "\n",
      "b_col + c:\n",
      " [[10 11 12 13]\n",
      " [24 25 26 27]\n",
      " [38 39 40 41]\n",
      " [52 53 54 55]]\n"
     ]
    }
   ],
   "source": [
    "print('a+b:\\n', add_ufunc(a, b))\n",
    "print()\n",
    "print('b_col + c:\\n', add_ufunc(b_col, c))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "A lot of things just happened!  Numba automatically:\n",
    "\n",
    " * Compiled a CUDA kernel to execute the ufunc operation in parallel over all the input elements.\n",
    " * Allocated GPU memory for the inputs and the output.\n",
    " * Copied the input data to the GPU.\n",
    " * Executed the CUDA kernel with the correct kernel dimensions given the input sizes.\n",
    " * Copied the result back from the GPU to the CPU.\n",
    " * Returned the result as a NumPy array on the host.\n",
    "\n",
    "This is very convenient for testing, but copying data back and forth between the CPU and GPU can be slow and hurt performance.  In the next tutorial notebook, you'll learn about device management and memory allocation.\n",
    "\n",
    "You might be wondering how fast our simple example is on the GPU?  Let's see:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 15.95 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000000 loops, best of 3: 1.29 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit np.add(b_col, c)   # NumPy on CPU"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 691 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(b_col, c) # Numba on GPU"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Wow, the GPU is *a lot slower* than the CPU??  This is to be expected because we have (deliberately) misused the GPU in several ways in this example:\n",
    "\n",
    "  * **Our inputs are too small**: the GPU achieves performance through parallelism, operating on thousands of values at once.  Our test inputs have only 4 and 16 integers, respectively.  We need a much larger array to even keep the GPU busy.\n",
    "  * **Our calculation is too simple**: Sending a calculation to the GPU involves quite a bit of overhead compared to calling a function on the CPU.  If our calculation does not involve enough math operations (often called \"arithmetic intensity\"), then the GPU will spend most of its time waiting for data to move around.\n",
    "  * **We copy the data to and from the GPU**: While including the copy time can be realistic for a single function, often we want to run several GPU operations in sequence.  In those cases, it makes sense to send data to the GPU and keep it there until all of our processing is complete.\n",
    "  * **Our data types are larger than necessary**: Our example uses `int64` when we probably don't need it.  Scalar code using data types that are 32 and 64-bit run basically the same speed on the CPU, but 64-bit data types have a significant performance cost on the GPU.  Basic arithmetic on 64-bit floats can be anywhere from 2x (Pascal-architecture Tesla) to 24x (Maxwell-architecture GeForce) slower than 32-bit floats.  NumPy defaults to 64-bit data types when creating arrays, so it is important to set the `dtype` attribute or use the `ndarray.astype()` method to pick 32-bit types when you need them.\n",
    "  \n",
    "  \n",
    "Given the above, let's try an example that is faster on the GPU:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import math  # Note that for the CUDA target, we need to use the scalar functions from the math module, not NumPy\n",
    "\n",
    "SQRT_2PI = np.float32((2*math.pi)**0.5)  # Precompute this constant as a float32.  Numba will inline it at compile time.\n",
    "\n",
    "@vectorize(['float32(float32, float32, float32)'], target='cuda')\n",
    "def gaussian_pdf(x, mean, sigma):\n",
    "    '''Compute the value of a Gaussian probability density function at x with given mean and sigma.'''\n",
    "    return math.exp(-0.5 * ((x - mean) / sigma)**2) / (sigma * SQRT_2PI)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 47,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 0.38114083], dtype=float32)"
      ]
     },
     "execution_count": 47,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Evaluate the Gaussian a million times!\n",
    "x = np.random.uniform(-3, 3, size=1000000).astype(np.float32)\n",
    "mean = np.float32(0.0)\n",
    "sigma = np.float32(1.0)\n",
    "\n",
    "# Quick test\n",
    "gaussian_pdf(x[0], 0.0, 1.0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 48,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "10 loops, best of 3: 51.7 ms per loop\n"
     ]
    }
   ],
   "source": [
    "import scipy.stats # for definition of gaussian distribution\n",
    "norm_pdf = scipy.stats.norm\n",
    "%timeit norm_pdf.pdf(x, loc=mean, scale=sigma)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 49,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 15.15 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "10 loops, best of 3: 7.69 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit gaussian_pdf(x, mean, sigma)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a pretty large improvement, even including the overhead of copying all the data to and from the GPU.  Ufuncs that use special functions (`exp`, `sin`, `cos`, etc) on large data sets run especially well on the GPU."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## CUDA Device Functions\n",
    "\n",
    "Ufuncs are great, but you should not have to cram all of your logic into a single function body. You can also create normal functions that are only called from other functions running on the GPU.  (These are similar to CUDA C functions defined with `__device__`.)\n",
    "\n",
    "Device functions are created with the `numba.cuda.jit` decorator:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 54,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import cuda\n",
    "\n",
    "@cuda.jit(device=True)\n",
    "def polar_to_cartesian(rho, theta):\n",
    "    x = rho * math.cos(theta)\n",
    "    y = rho * math.sin(theta)\n",
    "    return x, y  # This is Python, so let's return a tuple\n",
    "\n",
    "@vectorize(['float32(float32, float32, float32, float32)'], target='cuda')\n",
    "def polar_distance(rho1, theta1, rho2, theta2):\n",
    "    x1, y1 = polar_to_cartesian(rho1, theta1)\n",
    "    x2, y2 = polar_to_cartesian(rho2, theta2)\n",
    "    \n",
    "    return ((x1 - x2)**2 + (y1 - y2)**2)**0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 55,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 1000000\n",
    "rho1 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)\n",
    "theta1 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)\n",
    "rho2 = np.random.uniform(0.5, 1.5, size=n).astype(np.float32)\n",
    "theta2 = np.random.uniform(-np.pi, np.pi, size=n).astype(np.float32)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 58,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 2.30478978,  0.23699605,  1.02151287, ...,  1.00132406,\n",
       "        0.45568103,  0.3965269 ], dtype=float32)"
      ]
     },
     "execution_count": 58,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "polar_distance(rho1, theta1, rho2, theta2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that the CUDA compiler aggressively inlines device functions, so there is generally no overhead for function calls.  Similarly, the \"tuple\" returned by `polar_to_cartesian` is not actually created as a Python object, but represented temporarily as a struct, which is then optimized away by the compiler."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Allowed Python on the GPU\n",
    "\n",
    "Compared to Numba on the CPU (which is already limited), Numba on the GPU has more limitations.  Supported Python includes:\n",
    "\n",
    "* `if`/`elif`/`else`\n",
    "* `while` and `for` loops\n",
    "* Basic math operators\n",
    "* Selected functions from the `math` and `cmath` modules\n",
    "* Tuples\n",
    "\n",
    "See [the Numba manual](http://numba.pydata.org/numba-doc/latest/cuda/cudapysupported.html) for more details."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "\n",
    "Let's build a \"zero suppression\" function.  A common operation when working with waveforms is to force all samples values below a certain absolute magnitude to be zero, as a way to eliminate low amplitude noise.  Let's make some sample data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x11d2be3c8>]"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAD8CAYAAACW/ATfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8FdXZB/DfQxYIIWENEAIkLGEVAUkRlCKbbFqxrVro\nq2KrVatdbKsFtGpb17q0r0tt1bpVraKI1RcVZHFBVCCRLQlLIgmQELIBgQSyn/ePTO6dm9ybu83M\nuXPm+X4++TB37tw7z1xunsycOec5JIQAY4wxtXWSHQBjjDHzcbJnjDEH4GTPGGMOwMmeMcYcgJM9\nY4w5ACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TLDgAA+vTpI9LS0mSHwRhjtpKVlVUhhEgKZNuI\nSPZpaWnIzMyUHQZjjNkKER0KdFtuxmGMMQfgZM8YYw7AyZ4xxhyAkz1jjDkAJ3vGGHMATvaMMeYA\nnOwZY8wBONl7UXTiDD7dXyY7DMZcsoursOvISdlhKEEIgbczj6CusUl2KJbiZN/G61sPYdpfPsF1\nL22XHQpjLpc+9QUW/X2L7DCUsC6nFHes2o3/eX6r7FAsxcle52x9E+56N1t2GIz5lLb8A+QcrZId\nhq2dqm0AAGQeOoFtBcclR2MdTvY6m/PKZYeglLTlH7h+nHbJbKaf8FVnyJ7YkIffr9rtenzVs19J\njMZanOx1Hvxwr8fj4pNnJUVif8t0v1AAUFldLykSezt5ph5pyz/wWFd2uk5SNPb3tw0HZIcgDSd7\nncLKMx6PL3x4ExqbmiVFY28rM494PL7g4U2obeCz+2AdPVnrdX1rUwQL37Eq75+xajjZa97bWex1\n/b3v51gcif35Sup3rt5jcST2tzbnmNf1/93h/fvKgld2mpO9Y5yubcCv39zp9bnXtx62OBr7G3X3\nWq/rV3OCCtqTG/O8rr/nPT4JCZavq/TLnnZGLydO9gA+yvZ+9sSCV1HN7cksMj37+UGfz9XUNVoY\niRyc7AGPu/MsPKu/KerweSGERZEw5unRdft9PvfYx76fUwUn+wAUnTjjfyMGAMgrre7w+fvW7O3w\neRa4VVkd/2FlbtsLO+5P/9KWQmsCkYiTfQCyDp2QHYJtvO0nAb24pcCiSOzvHT+f5e1v77IoEvvj\nq3dO9gHxdfOWMTP9jpO5YQoqamSHIB0ne2YY7kdvHP4srXeiRu2Bf45P9tUOuAtvldXfcNdKo7zj\n50Z3q4/2lJgciXNc/YLahdH8JnsiGkREnxBRLhHlENGvtfW9iGg9EeVp//bUvWYFEeUT0X4immfm\nAYQr08+Nm1ZNzdyLxJ9N+7gstFEC7bT089e/MTcQB8k5ekp2CKYK5My+EcDvhBBjAEwBcCsRjQGw\nHMBGIUQ6gI3aY2jPLQYwFsB8AM8QUZQZwRsh0FLGw+780ORI7G/D3tKAtttbovYvlRG4Gcc4Tqps\n2RG/yV4IUSKE+EZbPg1gL4AUAIsAvKJt9gqAy7XlRQDeFELUCSEKAOQDmGx04My+Fjyxmfvb+3H/\nB9xF1ShOqmzZkaDa7IkoDcBEAFsB9BNCtDYYHgPQT1tOAaCvglWkrWv7XjcSUSYRZZaXyykt3Oyj\naeZ3F4+wOBJ1TU7r5XX9O9y+zyT6/I6ZXtdvPVhpcSTWCTjZE1E3AO8AuE0I4XEdLlpO04I6VRNC\nPCeEyBBCZCQlJQXzUsO8uf2I1/W/nJ2Or1fMtjgaNb1181SsvHFKu/VcViF4REBa767t1vs6aWG+\nDe7dFfddfk679VVn1a0mGlCyJ6IYtCT614UQq7XVpUSUrD2fDKD17lwxgEG6lw/U1kWcgor2oz3v\nXDgKABDVido9d7ae21GDceWkgQCAmOj2XzNuxQnOTy5Mw8EHF2L+OcntnlsVYM8d1uKzO2YAcH8/\n9VT+uxlIbxwC8AKAvUKIv+qeeh/AUm15KYD3dOsXE1FnIhoCIB3ANuNCNs7zm9uP5rxx+jAAQFJC\n53bPPbXJewVC5v1M/Y+XjQUATBzUo91zj67bZ3pMdlXf2L46473fG4uWX8X2eHSob97uDaX2jgcA\ndImJwtCkeI/n1ucG1snAjgI5s78QwDUAZhHRTu1nIYCHAVxMRHkA5miPIYTIAfAWgFwAawHcKoRQ\n4pT4mU+/lR1CxPrb+vYzAMV3jgYAEBF23H2xx3Mqn0GF60y977Efo5MTLIzE/h5e63lSkd63m8fj\nt26a6vE40PENdhTtbwMhxBcAvJ9SAF4btoUQDwB4IIy4pBiX0l12CLbVtu7/nj/O9XjcPS7GynBs\nraP+3osmpKBH11gsfTEiL5YjzjtZni3Ib7a5f5TYxTnfS8ePoNXrl+jZdPPE4gmSIrG/hDa/RJ28\n3ANh3v3PvzxHct5z6RiPxxeNkNOhwY7aTt/Yu5vn73isl/tJqnLOkbbhrQfD7fNGejxeNKFdj1HG\nLPfTaUNkh2Bb3u5/OJVjk33Zac8bij+YmIJR/RMlRcNYcNqekR45znMusI45Ntm3bVUY2d/7ja+U\nHnEWRKOWF5ZmeF1/+YQBFkeirivadBtc52NicuZfcvcuHo8XPf2FpEjM5dhk39imGWfsAO83Z6+7\nIM2CaOytbZPYuIHeP8vHr/K8B8Jno+0FOivanNF927zurBnhKGXe2H5e16//7UUej3cVVVkRjuUc\nm+xvW+k5Icm09D5etzt/qPfh/sztdK1nV8G+CV28btd2oFpe2WnTYrKr0lOezYt7/zzf63azRnkm\nrl1FJ02LSRUj+3m/eu/W2W+nRCU4NtkHWgnvnDZn/FzAq73xf/44pNfVNvDNs7bazq8QFxtYwdgd\nhznZ++NtoKSTODbZB6rtoMXdil7iyfDs5wdlhxBxuP+8cfJKPa8cLxvvu3fdKB/37FTCyR7AL2cN\n9/lc2yHqfCPMOLuO8NloOBaO6y87hIh206tZHo+7d/U9gGpQL88Ccw1N6l11crIH8L3xgfcS4ZIJ\nHfv5jGEdPj92AHdvNcqKBaNlhxDRDgYxyfhtc9I9HlfXqjddqSOTfWWbol0jfNy4Yf6dbjNCcfao\nvj62bHHJue2rNjLvYqI6HnXc9myUhc5XbzyVODLZnwryr3ZcTMTOqihdsGWfuStr4OaODa6ZpvRU\nrUmROI+K3TAcmewbg2yPG9LHswwq98jRCbLkTddYZ3RzM8K5QRbm27iXJ3z35Vcd3Jfz5ov8CpMi\nkceRyf7JTflBbd+2IFr5aZ5lqRW1yfbjvdSu70gT1zr2afHkwUFtz5OU+/ar2en+N9Jv/8YOkyKR\nx5HJfn1ucD1q0tu06XN6cms7jVtMVHBfKZWngQtXsGWh+XvpW3SQ30sVOfITCHcwz9GTPDS9VUEQ\nPR684a6sbsE2L7b1+YFygyJxpljF/yCofXQB+OtV44N+zWGu6eLys39nBv2aacPdpSlWrN5jZDi2\n9tNXgv8sr8pwF0T7jJO9Syj31b5cMcuESCKH45N9KF0BeeJx7/7+4/MC2u4cnhHMq1DOzO/53lgT\nIrG/tjOnBaJPN7XLKTg+2XeODr5b5XI+G/Xq4jHeqwq2de3UVJMjsb8Xr/NeJrotpxTxCtZDH+6V\nHULEcXyyZ8bxNwio1QCeI8CvSYO52mo4avjqux1O9gHStzMz79rWEWKh66iOS0e8TbfpdDdf1HEJ\nD6fgZB+gB78/TnYIjPl1jEfRtjMsKd7/Rl6E2zsq0jgu2a/PLQ3pdZ1jHPdRMRv6YHeJ7BCkO1Tp\n2R34ohFJAb/2u7pJjAor1ep157gMpu8q2Ds+NuDXBTvAxQmOVRlzFsnlJ4wT6LSGKjt5xnOgXt9E\n7zOneXPj9KGu5WLFxtM4Ltnr+Zpk3JsuXAytnXd3FBvyPntLeHpCozRwmz06hXHvKF7Xu0m1iWQc\nnexvnRlccSQ9rkMC/GXtPtdyOFc+73xTZEQ4tnaqNvSyEcsXjHItf7afB1aF009gwsDgajvZiaOT\nfTh9lFW7xAvXU0smBrV9f92lNRdDA/aFcXWzRFcwjb+X4enUSd0eZY5O9uFMQLy7iKfU09Pf2ArE\nv6+f7FpubFar10MoToVREI7vJ3mqqVNvlikjODrZBzu45/M7ZrqWf7Nyl9Hh2Fqwfez1s4O99nXw\nQ9tVc4Ou48B/fna+xEjs7w//zZYdQkRydLIPVlSAI0QZC8d5g3vKDsHW8sqqXctXTwluTgCVcbIP\nQv8gunAxxuS7ftpQ/xs5BCf7IEQpfPOGRQ4edmCctN7hTcpe36jO/SRHJXsevMPsQPCcU4YJt15T\nOF1iI43fZE9ELxJRGRFl69b9kYiKiWin9rNQ99wKIsonov1ENM+swEPBPfyYHYR7TpJfxoPUwnHX\nwtGu5WaFThADObN/GcB8L+v/JoSYoP18CABENAbAYgBjtdc8Q0QRM/T0P1sPuZYfueJciZEw5lu4\n6aXEoDIWTjV7dF/X8uYDFRIjMZbfZC+E+BzA8QDfbxGAN4UQdUKIAgD5ACb7eY1l7n4vx7U8OY3r\nhYfjeE29oe/HNV3CEx/rPqc6pFgBL6vF6Oai/ThXnTmSw2mz/yUR7daaeVr7iqUAOKLbpkhbF3H4\nZmt4Tp5xJ/tbZ4ZfL3zJ81+H/R521fYmoD5xB+pe3fSEG/eGVtmVtUjRjb9Zl6POZxlqsv8HgKEA\nJgAoAfB4sG9ARDcSUSYRZZaXW1/Po2sIv1BtObk+jn5Y/o/PD22aQf2kEkeOO3eYf0GFZ0neUG4q\npvR0J6g9xafCjsmujCi9oWrJhJCSvRCiVAjRJIRoBvA83E01xQAG6TYdqK3z9h7PCSEyhBAZSUmB\n15s2Sm8DJhf+KNu5tcOvecFdETChS2g1hq7MGGhUOLamL70xZ3Rg8/i2daFuJrWK6rqwY7IrlbpK\nGi2kZE9EybqH3wfQ2lPnfQCLiagzEQ0BkA5AqTqhc3Q3b77Iq5QYSeRI7BJabZYUnosWAHDHqt2u\n5Vmj+nawJfOHu636FkjXyzcAfAVgJBEVEdH1AB4hoj1EtBvATAC/AQAhRA6AtwDkAlgL4FYhhFJt\nHY9cMd61rFIfXBYZlkwe5H8j5lOjrhnnoR/wVKJ6fq+/hRBLvKx+oYPtHwDwQDhBRbJEXZNFqFMc\nMuYLT9oenhXv7HEtzx/b35D3FEIo8f/iqBG0RoiO4o/MKJ2j+bNkxvpgj/s+Wo+uxpR+VmW+Bf5t\nY9KocLbEIpdR3y9VvqeOTPZGdLtkjDlD6Sk1RiQ7Jtnru2T14Jl9wnKYR2hGPC76Zxx9fXw7c0yy\n/2R/mWtZ1UETVqkKYwo9Zp4+urEjnx3gicfD8evZ6a7lpS+q0XvcMcl+g67nzN9+NEFiJPZnZBPm\n7XNHuJbP1PPcoeG4ZYZ7RPLGvWUdbMn8UXHAn2OS/UHdkPThSd0Me18njtjrZGC2H9w73rWsSKcH\naXp3i3Utv511pIMtmT8qDvhzTLLPOnTCtRxt4FyylTXOG5reSfeteekn3wnrvXrHuxOU0ytfzhgZ\nXtmQy8YPcC3XNjjvJMRIqvTA0XNMstfr1jm0Wi6tZuuGtK/c7rwzqBc2F7iWZ44Mb3j/uQO7u5b3\nlji3gBcAxIf5vVQxQTHjODLZhytO13Xzq2+dVx/n7awiw95Ln6Be2lJo2PvaxatfFbqWozhZMxNx\nsg/B9BHuy+2tBYHO68K8iYtx/+HcXVQlMRI59BPq/HTaEImR2J/+Bv+EQT0kRhKZHJnsw73c1beN\nsvDwJDJuo5MTZIdga/oeSM9dO0liJJHJkck+XF1ieAQuMx6B//CF45dv7HAt903oIjGSyMTJnrEI\nEcuF4ZiJ+NvFGGMO4Lhkf9P0obJDYIwxyzki2euLQl16Lt9cZYwF53hNvewQwuaIZH9IV6WRuzKH\np65RqVkmGfNJP/n795/ZIjESYzgi2e8qOula5mQf2Xhe3/A8dqV7juTGJi6ZEI4Hv3+Oa7mmzv4n\nOY5I9q9+dUh2CCxATqqVX3XG+D9sCbo5kvUnOSx4fRPd3Tcrqu1fA8sRyb5BV04xKaFzB1syf/Rt\nl3fMGykxEvurMKGInr7u0z260bmMOSLZ7zriPsPhwRbh0V8l3TpzuCHvqS8n+/VB59Qa0k8m9eSS\niYa8Z2IX9yxsOUedU1iOZ+byzxHJ3gz3XDrGtazKHJWBeObTbw1/z2umprqWX/6y0PD3j1T6Kp9z\nRodXPdTpDpSqMXWgmTjZh2jykF6u5eo6nmEpHEunprmWi06clReIxfRXnF1jwytv3GrMgERD3sdu\nGvhmtF+c7EOkv2p8aUuB7w2ZX/qS0U7yry+M/944tbBcje6ESz8XL3PjZB8iAXe2f+3rwxIjYYzp\nZ5/7esUsiZFELk72IerfnW/0MhYp9Cdc0VHmpDW7zzfNyT5E3KuHscjx7o5i0/fR1GzvHj+c7Blj\nLAD6pls74mTPmMK4/7lx7P5RcrJnjDEfxqV0dy3bPNern+wPVda4lh/6wTiJkTBmjekjklzLdj8b\nle1H3xnkWt552N61hpRP9keOuwfpXD4hRWIk9qcvUmb0pOv6kgksPD/77hDZIShjyeTBruX9pacl\nRhI+5ZN9Y7O7u5Tdb7DIpi8/fLeuXIQRli8YZej7OVmv+FjXcj2PLA2LfpDaYV0rgR35TfZE9CIR\nlRFRtm5dLyJaT0R52r89dc+tIKJ8ItpPRPPMCjxQK7cfkR2CMvRdz3p2jelgy+ANTYo39P2cLL1v\ngmt5VVaRxEisN6JfN9Pe+xWbl0oP5Mz+ZQDz26xbDmCjECIdwEbtMYhoDIDFAMZqr3mGiKSOhf8o\n+5hr2aj6I61unTnM0PeLdPrCXUYPXOkS48ySCQBwVcZAQ98vNtr9f7Ot4Lih7x3p+B6Fb35/Y4UQ\nnwNo+41ZBOAVbfkVAJfr1r8phKgTQhQAyAcw2aBYI05yd2e1M28rNC9xdNJNIab/o6IqfZfIv/zw\nXNP280V+hWnvHYk41/sW6ulZPyFEibZ8DEDrZI0pAPTtJkXaunaI6EYiyiSizPLy8hDDkEvf9FBS\npX61xtXfmDdKsUecu1lo074y0/YTKQ4f18+LbF7xMhUmyvan6qz7XpIzy8AFJuxrcdFyihL0H1Qh\nxHNCiAwhREZSUpL/F0QgfdvoxzmlEiOxv566m4of56r/WR496Zw5EMymr3ip7z3DPIWa7EuJKBkA\ntH9bT8WKAQzSbTdQW6ek7rqz0Xvf5yngjKKv866qf35m/CQwTqVP9j8+n5O9L6Em+/cBLNWWlwJ4\nT7d+MRF1JqIhANIBbAsvxMilvxHGWDA+O2DPpstIpG/2c/KNfn/8dk8hojcAzADQh4iKANwL4GEA\nbxHR9QAOAbgKAIQQOUT0FoBcAI0AbhVCNJkUO2OM4aGP9skOwRb8JnshxBIfT832sf0DAB4IJyjG\nGGPGckw7hFOna2OMheeJxRNkh2AIxyT7OG7LY4yFQN811s4lox2T7BdNMLZwF2ORLLGLsaPFnSxK\nl+zzyqolRhIepZN9bYP73nC/RJ5G0Cg3TOOqipFuGReWM8zcsf1cy3aeh1bpZF+oq1Kn7xPPwtO7\nW2fZITA/esfz/5FRYnR1oJq5GScy/XfHUdcy10s3TrRJN7vPH9LLlPd1onm6s9ETDiiZYJWvD1bK\nDiFkSif7dTnuipdTh/WWGIn96ZvEEuPMaQ+eMbKva9nON8L8OVvv/iwH9jTnJER/U3GnA0YkW+Wx\ndQdkhxAypZO9vr5UfGe+YRWOugZ3W+WFw/uYsg9926i+NLVqDuhmPPrDJaNN39+d7+4xfR9OYefJ\nYJRO9lYY2scZk258nOtOvmY1iQ1Lck88ceKMuk0P+lou+vlizVJSxUXXmOLJ/mC5+dOI3TzDPYFJ\no43/6vtzx6rdrmUzS/K2qm1Q97PMPHTCtWz0hDpO1plrVXWIP50w6SfeztL9ErPw3LcmV3YIptno\ngHr9Mswd2192CBGNk32Y9FX2bl+1S2IkzC4qTtfJDkFJZvUSUwUnewMdOa7+bFUsfMUn+XtilMOV\n7hm/ppnUcUAVnOwZY7ZVWeO+Srrk3GSJkUQ+TvaMMdtaqxtLwxOXdIyTPWPMtrKLqyzZz5jkREv2\nYyZO9owpakS/bv43srkt+daUL+ibaP9aQ5zsGVNU52hu1jDK7FF9/W8U4ZRN9vraKhMH95AYCWNy\n2LlCY6SZp+vDb9fxNMom+2bd93x6uvlD0p1izmjrznBULoZmhZH9EmSHoIy+uvkwKqvtOU5C2WR/\nsNw9o8wN3+XJNoxiRamEVifONFi2LxVNSuvpWi49xfVxjHK6ttH/RhFI2WRfrhulmNDF3IlLVK+V\nry/c1Ts+1tR9LZ2a6lpWsRlCXz9pxkhzrzgvHecu5bElv8LUfTnJtoLjskMIibLJ/p1vii3b14M/\nGGfZvmTQj/i8ekpqB1uGL6qT+ytp5yngfKnTHVO8yUXQ9PMONKv3d1OalZlHZIcQEoWTfZFl+xrS\nW+0yx7lHT7mWE02+Sho/qLtr+axuwhRVROnqt5jdIqZvcjtTb8+mB2YcZZO9lRqa1TsD1Xv6k3zX\ncp8Ec5tx9L0eMgvtebkcie55L0d2CKYalqT2CZcRONkboLHJfY2sYg+S/DL3zW6z66/rh7wXVJzp\nYEt7OqQr3HX5hBSJkajltjkjZIcQ8TjZG2B4X/dIxaITXNHQKP/87FvZIRjuyHF3sk9zyCxnVlhw\nDtey94eTvQGsbIdl9qZvEuOmB+NER3Eq84c/IYM9sSFPdggsgu08ctK1bOWYBRVV1/FN52BwsjfY\n21nW9QJizMl2HLZn2QJZONkzxmyppIpHBQeDkz1jzJb0N7uZf5zsGWO29NSmfP8bMZewkj0RFRLR\nHiLaSUSZ2rpeRLSeiPK0f3v6ex/GGIt0f7hktOwQwmLEmf1MIcQEIUSG9ng5gI1CiHQAG7XH0qxY\nMErm7hljiviurlT62Xr7lfIwoxlnEYBXtOVXAFxuwj4CduP0oTJ3z5hUP58xTHYIyuib4J6acP3e\nUomRhCbcZC8AbCCiLCK6UVvXTwhRoi0fA9AvzH0ErVlX4o/7Mhsnlgeu2M4VkwbKDkEZPXXlve14\nczjcQifThBDFRNQXwHoi2qd/UgghiMhrsRjtj8ONADB48OAww/D0dYE1kxA7zaNXnis7BBakYUnu\nUh61DU0etYdY6JpsWDM6rFM1IUSx9m8ZgHcBTAZQSkTJAKD9W+bjtc8JITKEEBlJScZO4qB4EUpp\nRvVPtGQ/d186xpL9yCRjysDPD5Rbvk9V6SehsYuQkz0RxRNRQusygLkAsgG8D2CpttlSAO+FG2Tw\nsVm9RyChi7nVIGU5XeueGnBEv24dbGmcn16YZsl+ZLptTrrl+1RxfgBZ1uwp8b9RhAnnzL4fgC+I\naBeAbQA+EEKsBfAwgIuJKA/AHO2xpbZKmDbs9rkjLd+nFcp00ztadf9Dvx8VS0YDwIXpfSzf5zEe\ncWqYg+U1skMIWsino0KIgwDGe1lfCWB2OEGFa/+xU/43MpiqZ/ayc+2e4iqcO7CH3CBMECeh7fzl\nLwtx00XcO8eplOxeIaNmxjQJZ2pWkH1mbcczqEDESOjZpFItmfyy067l76TxuM1AKJnsdxdVWb7P\nvgldLN+nFWQ3oqxYvUdyBMY5pbv/wcJT2+C+QdovUc3fPaMpmexls2MfXF/+8anc2aJUuqn42X7u\nDWOU4pPuGeEev6pdazLzgpO9CdblHJMdgmHe3VEsOwRl7JNwL0lV23SdMDpH89iBQHCyN8FrXx+S\nHQKLQH//RL05dZl9cLI3QWGlOs04jEWi+kb7DWqSjZM9Y8x2XuWr56Apneznj+0vOwTGmEKumZIq\nO4SQKZ3so6O44iVjzDiDe3WVHULI1E72nTjZM8aMM3NUX9khhEzpZD99hLHVNBmzoyu5pr1h+nRz\n17Sva7TXGBClk/2Fw9UsYcDsLbW3tU0Bs3Rno3acTi+S9OjqTvaNTbLHlwdHuWSvn1SgR9cYiZGo\n5a2bplq6v+suSHMty67PY7Q5o62dvE1frFSlEcmyNdvse6lcst9b4h6lGNNJucOTZnSytZNtDOjh\nrnfSaMNZgToSZfG9pJQe7iuJ3UUnLd23ygoq7FWkT7lsWHrKXdmPp581TrTFfzgTu7ivyux2BuWP\n1YW70nWTzmwvtH6uB1Vd9vQW2SEERblk/4v/7HAt82TjxomLtbb+iP5+i2K5HlecZ+0NU/28s/+3\ny34zLDFjKJfsZbZJvvGzKdL2bYZmic0n3buqe2Yvs3XRjnOnMmMol+xlmjqst2tZhV+q7z7yibR9\nezbjSAvDFFY3iek1KfCHU7Ub9lbhZG8SFX6p9DXDZapVoAeJ/o+/zDP7E2fsP4FK0Qn39/KuhaMl\nRmIvnOxNknuUa5cbZVVWkewQwrbziLsXjMz66ypUi9Q3610z1b61aqzGyd4ksmd4UokKyf7JTfmy\nQ1DG53kVrmUZc/naFX9SJqlXoM0+UuSXVcsOIWxb8iv8b8QCov/jb/WYBTvjZG+ST3m+UabT1eKu\nq4y1xcmeMWYrdZJv2MfatOnInlEzZjf275wVMfYdOy11/3atwmLTsBmzl9N1jbJDYAbpb3G5C6Mo\nm+x/aPGQdMaYMzzw/XGyQwiJssmeb4jZX0KXaFPe98tvK3Cq1v6Di4Lxq1nDw3p9RXUdsg4dhxAC\nG3JLPUqJ+3tdpmLF1y7QjZTfuLdUYiTBUTbZR1qXrLJTtfjP1sMhvfajPSU4UOq7nXJtdgn2B9CO\n+d7OYhRU1OClLQWoCmIkZaJJSdeflB5xhr9n1dkG/Pj5rbj51Sy/25afrsPv3tqFzXn271l1+cQU\nr+v3lpzCxznHfL7u9a2HsD63FBn3b8AP//EV/vR/ubjh35lY/NxXAe034/4NuOKfgW1rR9e/kolP\n9pXJDiMgcn6LTVBZXYfH1x+QHUY7z3yaj20Fx11dMRuamrFUm5jjUGUNXtpSCAC4ftoQDNJNZrzv\n2Cm8v/ModhdV4Qutj3bhw5dgfW4pKqrrsGTyYNe2N7/2jet5vVVZRegaG4WF45KxKqsIt7+9y/Xc\n1wcrMaBkRPgdAAAN2klEQVRHHH5ywRAcO1WLe9/Pwdwx/XCmvhF3LhyN/92Q59r2T4vGGveBBCE2\n2n0uIoTwqGJaWFGDG/6diWumpLo+z1aNTc0YftdHAICbLhqKH0wciJH9W+rxN2jjH/YdO43nPz+I\nbl2isTb7GEb2T8CKBaM89nHr699gW+FxvPNNEQofvgT/2nwQYwYkAgB+/PxW7LznYnywpwR/+G82\n8h9Y6PUEI+vQCXz1rfw+9vrvlhACd76bjQOlp5F16AQAIOdP83D/B7m4c+FoJGh1iQ5V1uCud7M9\n3uflLwsBANsLT4QVz9Ob8jB1WG9MSu3l9fn3dhbj12/uxL775uPj3FKsyzmGC4f1wY/PH+x1eyu1\nraa7Oa8CmYeO4/a5I9s9V9fYhJF/WIt/Xj0J88/pb2WY7VAkFBXKyMgQmZmZIb32ltezkJHaC9lH\nq7D6m2LX+uevzcDFY6ydEQgA0pZ/4Fq+59Ix+POa3HbbtCblBU9s9phsBQD+etV4/PatXe1e480t\nM4bhy28rXUPxCx++BMtW7UZOSRWyiwMr1zA0KR4Hyz0nYdj8+5keRdD23Tffo0yuVdZmH8PNr7Wc\ngV80IgmfHShHr/hYzBiZ5PF/DbRMLr/ypqmYlNoTn+4vw3Uvbfd4fmhSPJbNH4WbAjijv2n6UFw4\nvA+ufXGba12XmE6obfAcKBcTRWjQpqa7dmoqfj9/FC576gsc1Ca1uHrKYLz2dfurubZ/lK3Q1Cww\n7M4PAbTMo1pRXe/x/OjkRI/v4ryx/ZBZeAKVNZ7bebPvvvkYdfdaAMCtM4fho+xj7b5Tg3rF4Wff\nHYp73svxWL9iwShkHz2F+y8/B+P/9LHfffWKj8VxXUwyPkvA8/dc74WlGbj+Fd+57JezhmPH4ZOu\nE7jXrj8f09JDnz6ViLKEEBkBbWv3ZO/rQz/44EJ0ktCU4yseO5P1WWYWHleyCUBGghJCYMiKDy3f\nr9kiLdmH4sklE3HZ+AEhvTaYZG/rNvvK6jqfz8lIToCavYBkfZZJCZ2l7NdMf5bUJKbiRD5xEq42\nzfCrN3b438gAtk72x3RTEEaKmCj1fqlkSe0dLzsEww3po94xyXL+UO/t/cw705I9Ec0nov1ElE9E\ny83YRwS0QLWj2uTYzFhjkhNlh6CMGSOSZIdgK6YkeyKKAvB3AAsAjAGwhIjGGL2f07WRNyoxEv8A\nscjRKz5WdgjKmDC4p+wQbMWsM/vJAPKFEAeFEPUA3gSwyOidPLEx8rpa/nbuCNkhsAimYtu5LN06\nq9FmbxWzkn0KgCO6x0XaOkMFOorPSt06KzN0gbGIJnPGLzuSdoOWiG4kokwiyiwvD22EYt+EyCtI\n1D0uxv9GjLGw6QeKMf/MSvbFAAbpHg/U1rkIIZ4TQmQIITKSkkK70dIznhMrY4wFwqxkvx1AOhEN\nIaJYAIsBvG/0Tvp0U68fNmMs8o3o1012CEEzJdkLIRoB/ALAOgB7AbwlhMjp+FXBWzo1zei3ZIwx\nvy4ZF9qIV5lMa7MXQnwohBghhBgmhHjAjH309NGNLYFvkhrmkR+eKzsEZpCrMtQb3S3LpFT7dfu0\n9QhaX1YsHC07BGVcNsF+ZzDMu3u/J6dUg4r6JBg3XsKqgXa2T/b775/v8XjhuP5YMnmQj62t8evZ\n6V7X3zFvpMWRuK27bbrHhC76AlLThrdU3fNWJVT2vAC3zhxm2nuPHdDxL9njV45vt+7tm6cGvZ+J\ng3sAAB69Qu5VUryPK96+HdQgyv3zvA7bp/19hqoa1T8xoHLLGQFcAay+5QIjQvLL9lUvAeB0bQPq\nG5tRXdcYEfVUhBA4cvwsBvWKw7qcY65681l/mINJ929AXEwUzjY0uUoJ3zJjGH70nUHo1jkasdGd\ncKjyDFJ6xKFnfCwOlldj1uOfed3Pht9Ox4Aecag4XY+HPtqLj7Ldk1Bcd0Ea+nSLxWMfH8B5g3tg\n9S0XoqauEXll1RiX0h1RnQg1dY0429CEnl1jcfTkWQzq1RXHa+oRHUXoHN0Jx2vqkdzd+AlEgtHU\nLHDJk5s9Jpm+bU46Sk7WYmWmeyjHU0sm4uIx/XDiTD0251Xg3IHdUXKyFlVnG1B88iweXbe/3Xvn\nP7AAH2Yfw/Ckbvjy2wpMHtILKT3iUFJVi/jO0RjSJ96juuFLP/kOZo7si+ZmgaF3uitIvrA0A2MG\nJKJZAE1NAtMf/QTjUrrj9Z+dj6ozDRjUqysOV57BoF5x0gdVnaipx/0f7MU73xThugvScNucdBw5\nfhbfe/oL1zYj+nXDgdJq/GLmcNw+byRqG5pQdbYB5z+4EQDw5o1TsPi5r/HYleNxxaSBeGpjnmsu\niZsuGopnPzvoN46hfeLxi1nDPcp5b/79TMTFRuHnr2Vhe+EJXHdBGpZMHoyfvrwdxSfPIqoTucbW\n7Lp3bkR0c25qFjh68qyrJPhz10zCja9m4fsTU/DujmLcdNFQLJs3Cnll1cg5WuU63u5xMag62zKB\nUDiVO4OpegkhhPSfSZMmCVUdrqwRqcvWiCkPbhBCCNHU1Cyam5tFU1Ozx+OONDY1u34e+nCvmPno\nJ67Xt2pubhYNjU2iscn93tsKKkXqsjXiB89sMeHIrNPc3CxSl61x/ZyoqXN9hnMe/1SkLlsj9pWc\n6vD1jdrn3Pq6tp+fL637bLv9S18cFGPu/sjr+wTyfypTdvFJkbpsjSisqHata2pqFjsPnxCpy9aI\nw5U1rs9Lb/GzX4nfrNzh2r5V28+09XOe+dgnInXZGvHylgLR2NQsHl+3z/V5frD7qCg5eVakLlsj\nthdUikbd++UUV4nUZWtEQbk7vvrGJnGmrlGkLlsjRtz1oSmfSzhaj0uIls/mYHm1SF22RuQerfLY\n7qJHNokHP8x1fScD/R76AiBTBJhnlTizj2RNzQI3v5aFmy8aZvlNnaxDJ/DDf3yJacP74LUbzrd0\n30b77cqduHR8MmaN8mxq2l10En9bfwDPXZuBmCjjWyUf+nAvhiV1w1Xfkds0qIrWK6VQzmaFEPjF\nf3bgf6YMxgXDQp/wwwxvbDuMwsoarFhg7f1CR01ewnxrbhb424YDuGZKKvomRt5oY+Y8m/aVor6x\nGfPPSZYdihKCSfbcR1FhnToRfjdX3k1hxtpqe2XGrGP73jiMMcb842TPGGMOwMmeMcYcgJM9Y4w5\nACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TECFoiKgdwKIy36AOgwqBw7MBpxwvwMTsFH3NwUoUQ\nAc3rGhHJPlxElBnokGEVOO14AT5mp+BjNg834zDGmANwsmeMMQdQJdk/JzsAiznteAE+ZqfgYzaJ\nEm32jDHGOqbKmT1jjLEO2DrZE9F8ItpPRPlEtFx2PMEgokFE9AkR5RJRDhH9Wlvfi4jWE1Ge9m9P\n3WtWaMe6n4jm6dZPIqI92nNPkjbRKRF1JqKV2vqtRJRm9XF6Q0RRRLSDiNZoj5U+ZiLqQUSriGgf\nEe0loqkOOObfaN/rbCJ6g4i6qHbMRPQiEZURUbZunSXHSERLtX3kEdHSgAIOdP7CSPsBEAXgWwBD\nAcQC2AVgjOy4gog/GcB52nICgAMAxgB4BMBybf1yAH/Rlsdox9gZwBDt2KO057YBmAKAAHwEYIG2\n/hYA/9SWFwNYKfu4tVh+C+A/ANZoj5U+ZgCvALhBW44F0EPlYwaQAqAAQJz2+C0A16l2zACmAzgP\nQLZunenHCKAXgIPavz215Z5+45X9ixDGBz0VwDrd4xUAVsiOK4zjeQ/AxQD2A0jW1iUD2O/t+ACs\n0z6DZAD7dOuXAHhWv422HI2WgRsk+TgHAtgIYBbcyV7ZYwbQHS2Jj9qsV/mYUwAc0ZJRNIA1AOaq\neMwA0uCZ7E0/Rv022nPPAljiL1Y7N+O0fqFaFWnrbEe7PJsIYCuAfkKIEu2pYwBa53Hzdbwp2nLb\n9R6vEUI0AqgC0NvwAwjO/wL4PYBm3TqVj3kIgHIAL2lNV/8iongofMxCiGIAjwE4DKAEQJUQ4mMo\nfMw6VhxjSLnPzsleCUTUDcA7AG4TQpzSPyda/mwr012KiC4FUCaEyPK1jWrHjJYzsvMA/EMIMRFA\nDVou711UO2atnXoRWv7QDQAQT0RX67dR7Zi9ibRjtHOyLwYwSPd4oLbONogoBi2J/nUhxGptdSkR\nJWvPJwMo09b7Ot5ibbnteo/XEFE0WpoUKo0/koBdCOAyIioE8CaAWUT0GtQ+5iIARUKIrdrjVWhJ\n/iof8xwABUKIciFEA4DVAC6A2sfcyopjDCn32TnZbweQTkRDiCgWLTcw3pccU8C0O+4vANgrhPir\n7qn3AbTeXV+Klrb81vWLtTv0QwCkA9imXTKeIqIp2nte2+Y1re91BYBN2tmGFEKIFUKIgUKINLT8\nf20SQlwNtY/5GIAjRDRSWzUbQC4UPma0NN9MIaKuWqyzAeyF2sfcyopjXAdgLhH11K6i5mrrOmb1\nDQ2Db44sREsvlm8B3CU7niBjn4aWS7zdAHZqPwvR0ia3EUAegA0Aeulec5d2rPuh3bHX1mcAyNae\nexruwXJdALwNIB8td/yHyj5uXcwz4L5Bq/QxA5gAIFP7v/4vWnpQqH7MfwKwT4v3VbT0QlHqmAG8\ngZZ7Eg1ouYK73qpjBPBTbX0+gJ8EEi+PoGWMMQewczMOY4yxAHGyZ4wxB+BkzxhjDsDJnjHGHICT\nPWOMOQAne8YYcwBO9owx5gCc7BljzAH+Hwbqm2M7x3+uAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x11c802160>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# Hacking up a noisy pulse train\n",
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt\n",
    "\n",
    "n = 100000\n",
    "noise = np.random.normal(size=n) * 3\n",
    "pulses = np.maximum(np.sin(np.arange(n) / (n / 23)) - 0.3, 0.0)\n",
    "waveform = ((pulses * 300) + noise).astype(np.int16)\n",
    "plt.plot(waveform)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try filling in body of this ufunc:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@vectorize(['int16(int16, int16)'], target='cuda')\n",
    "def zero_suppress(waveform_value, threshold):\n",
    "    ### Replace this implementation with yours\n",
    "    result = waveform_value\n",
    "    ###\n",
    "    return result"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x1200f3518>]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXsAAAD8CAYAAACW/ATfAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJzt3Xl8FdXZB/DfQxYIIWENEAIkLGEVAUkRlCKbbFqxrVro\nq2KrVatdbKsFtGpb17q0r0tt1bpVraKI1RcVZHFBVCCRLQlLIgmQELIBgQSyn/ePTO6dm9ybu83M\nuXPm+X4++TB37tw7z1xunsycOec5JIQAY4wxtXWSHQBjjDHzcbJnjDEH4GTPGGMOwMmeMcYcgJM9\nY4w5ACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TLDgAA+vTpI9LS0mSHwRhjtpKVlVUhhEgKZNuI\nSPZpaWnIzMyUHQZjjNkKER0KdFtuxmGMMQfgZM8YYw7AyZ4xxhyAkz1jjDkAJ3vGGHMATvaMMeYA\nnOwZY8wBONl7UXTiDD7dXyY7DMZcsoursOvISdlhKEEIgbczj6CusUl2KJbiZN/G61sPYdpfPsF1\nL22XHQpjLpc+9QUW/X2L7DCUsC6nFHes2o3/eX6r7FAsxcle52x9E+56N1t2GIz5lLb8A+QcrZId\nhq2dqm0AAGQeOoFtBcclR2MdTvY6m/PKZYeglLTlH7h+nHbJbKaf8FVnyJ7YkIffr9rtenzVs19J\njMZanOx1Hvxwr8fj4pNnJUVif8t0v1AAUFldLykSezt5ph5pyz/wWFd2uk5SNPb3tw0HZIcgDSd7\nncLKMx6PL3x4ExqbmiVFY28rM494PL7g4U2obeCz+2AdPVnrdX1rUwQL37Eq75+xajjZa97bWex1\n/b3v51gcif35Sup3rt5jcST2tzbnmNf1/93h/fvKgld2mpO9Y5yubcCv39zp9bnXtx62OBr7G3X3\nWq/rV3OCCtqTG/O8rr/nPT4JCZavq/TLnnZGLydO9gA+yvZ+9sSCV1HN7cksMj37+UGfz9XUNVoY\niRyc7AGPu/MsPKu/KerweSGERZEw5unRdft9PvfYx76fUwUn+wAUnTjjfyMGAMgrre7w+fvW7O3w\neRa4VVkd/2FlbtsLO+5P/9KWQmsCkYiTfQCyDp2QHYJtvO0nAb24pcCiSOzvHT+f5e1v77IoEvvj\nq3dO9gHxdfOWMTP9jpO5YQoqamSHIB0ne2YY7kdvHP4srXeiRu2Bf45P9tUOuAtvldXfcNdKo7zj\n50Z3q4/2lJgciXNc/YLahdH8JnsiGkREnxBRLhHlENGvtfW9iGg9EeVp//bUvWYFEeUT0X4immfm\nAYQr08+Nm1ZNzdyLxJ9N+7gstFEC7bT089e/MTcQB8k5ekp2CKYK5My+EcDvhBBjAEwBcCsRjQGw\nHMBGIUQ6gI3aY2jPLQYwFsB8AM8QUZQZwRsh0FLGw+780ORI7G/D3tKAtttbovYvlRG4Gcc4Tqps\n2RG/yV4IUSKE+EZbPg1gL4AUAIsAvKJt9gqAy7XlRQDeFELUCSEKAOQDmGx04My+Fjyxmfvb+3H/\nB9xF1ShOqmzZkaDa7IkoDcBEAFsB9BNCtDYYHgPQT1tOAaCvglWkrWv7XjcSUSYRZZaXyykt3Oyj\naeZ3F4+wOBJ1TU7r5XX9O9y+zyT6/I6ZXtdvPVhpcSTWCTjZE1E3AO8AuE0I4XEdLlpO04I6VRNC\nPCeEyBBCZCQlJQXzUsO8uf2I1/W/nJ2Or1fMtjgaNb1181SsvHFKu/VcViF4REBa767t1vs6aWG+\nDe7dFfddfk679VVn1a0mGlCyJ6IYtCT614UQq7XVpUSUrD2fDKD17lwxgEG6lw/U1kWcgor2oz3v\nXDgKABDVido9d7ae21GDceWkgQCAmOj2XzNuxQnOTy5Mw8EHF2L+OcntnlsVYM8d1uKzO2YAcH8/\n9VT+uxlIbxwC8AKAvUKIv+qeeh/AUm15KYD3dOsXE1FnIhoCIB3ANuNCNs7zm9uP5rxx+jAAQFJC\n53bPPbXJewVC5v1M/Y+XjQUATBzUo91zj67bZ3pMdlXf2L46473fG4uWX8X2eHSob97uDaX2jgcA\ndImJwtCkeI/n1ucG1snAjgI5s78QwDUAZhHRTu1nIYCHAVxMRHkA5miPIYTIAfAWgFwAawHcKoRQ\n4pT4mU+/lR1CxPrb+vYzAMV3jgYAEBF23H2xx3Mqn0GF60y977Efo5MTLIzE/h5e63lSkd63m8fj\nt26a6vE40PENdhTtbwMhxBcAvJ9SAF4btoUQDwB4IIy4pBiX0l12CLbVtu7/nj/O9XjcPS7GynBs\nraP+3osmpKBH11gsfTEiL5YjzjtZni3Ib7a5f5TYxTnfS8ePoNXrl+jZdPPE4gmSIrG/hDa/RJ28\n3ANh3v3PvzxHct5z6RiPxxeNkNOhwY7aTt/Yu5vn73isl/tJqnLOkbbhrQfD7fNGejxeNKFdj1HG\nLPfTaUNkh2Bb3u5/OJVjk33Zac8bij+YmIJR/RMlRcNYcNqekR45znMusI45Ntm3bVUY2d/7ja+U\nHnEWRKOWF5ZmeF1/+YQBFkeirivadBtc52NicuZfcvcuHo8XPf2FpEjM5dhk39imGWfsAO83Z6+7\nIM2CaOytbZPYuIHeP8vHr/K8B8Jno+0FOivanNF927zurBnhKGXe2H5e16//7UUej3cVVVkRjuUc\nm+xvW+k5Icm09D5etzt/qPfh/sztdK1nV8G+CV28btd2oFpe2WnTYrKr0lOezYt7/zzf63azRnkm\nrl1FJ02LSRUj+3m/eu/W2W+nRCU4NtkHWgnvnDZn/FzAq73xf/44pNfVNvDNs7bazq8QFxtYwdgd\nhznZ++NtoKSTODbZB6rtoMXdil7iyfDs5wdlhxBxuP+8cfJKPa8cLxvvu3fdKB/37FTCyR7AL2cN\n9/lc2yHqfCPMOLuO8NloOBaO6y87hIh206tZHo+7d/U9gGpQL88Ccw1N6l11crIH8L3xgfcS4ZIJ\nHfv5jGEdPj92AHdvNcqKBaNlhxDRDgYxyfhtc9I9HlfXqjddqSOTfWWbol0jfNy4Yf6dbjNCcfao\nvj62bHHJue2rNjLvYqI6HnXc9myUhc5XbzyVODLZnwryr3ZcTMTOqihdsGWfuStr4OaODa6ZpvRU\nrUmROI+K3TAcmewbg2yPG9LHswwq98jRCbLkTddYZ3RzM8K5QRbm27iXJ3z35Vcd3Jfz5ov8CpMi\nkceRyf7JTflBbd+2IFr5aZ5lqRW1yfbjvdSu70gT1zr2afHkwUFtz5OU+/ar2en+N9Jv/8YOkyKR\nx5HJfn1ucD1q0tu06XN6cms7jVtMVHBfKZWngQtXsGWh+XvpW3SQ30sVOfITCHcwz9GTPDS9VUEQ\nPR684a6sbsE2L7b1+YFygyJxpljF/yCofXQB+OtV44N+zWGu6eLys39nBv2aacPdpSlWrN5jZDi2\n9tNXgv8sr8pwF0T7jJO9Syj31b5cMcuESCKH45N9KF0BeeJx7/7+4/MC2u4cnhHMq1DOzO/53lgT\nIrG/tjOnBaJPN7XLKTg+2XeODr5b5XI+G/Xq4jHeqwq2de3UVJMjsb8Xr/NeJrotpxTxCtZDH+6V\nHULEcXyyZ8bxNwio1QCeI8CvSYO52mo4avjqux1O9gHStzMz79rWEWKh66iOS0e8TbfpdDdf1HEJ\nD6fgZB+gB78/TnYIjPl1jEfRtjMsKd7/Rl6E2zsq0jgu2a/PLQ3pdZ1jHPdRMRv6YHeJ7BCkO1Tp\n2R34ohFJAb/2u7pJjAor1ep157gMpu8q2Ds+NuDXBTvAxQmOVRlzFsnlJ4wT6LSGKjt5xnOgXt9E\n7zOneXPj9KGu5WLFxtM4Ltnr+Zpk3JsuXAytnXd3FBvyPntLeHpCozRwmz06hXHvKF7Xu0m1iWQc\nnexvnRlccSQ9rkMC/GXtPtdyOFc+73xTZEQ4tnaqNvSyEcsXjHItf7afB1aF009gwsDgajvZiaOT\nfTh9lFW7xAvXU0smBrV9f92lNRdDA/aFcXWzRFcwjb+X4enUSd0eZY5O9uFMQLy7iKfU09Pf2ArE\nv6+f7FpubFar10MoToVREI7vJ3mqqVNvlikjODrZBzu45/M7ZrqWf7Nyl9Hh2Fqwfez1s4O99nXw\nQ9tVc4Ou48B/fna+xEjs7w//zZYdQkRydLIPVlSAI0QZC8d5g3vKDsHW8sqqXctXTwluTgCVcbIP\nQv8gunAxxuS7ftpQ/xs5BCf7IEQpfPOGRQ4edmCctN7hTcpe36jO/SRHJXsevMPsQPCcU4YJt15T\nOF1iI43fZE9ELxJRGRFl69b9kYiKiWin9rNQ99wKIsonov1ENM+swEPBPfyYHYR7TpJfxoPUwnHX\nwtGu5WaFThADObN/GcB8L+v/JoSYoP18CABENAbAYgBjtdc8Q0QRM/T0P1sPuZYfueJciZEw5lu4\n6aXEoDIWTjV7dF/X8uYDFRIjMZbfZC+E+BzA8QDfbxGAN4UQdUKIAgD5ACb7eY1l7n4vx7U8OY3r\nhYfjeE29oe/HNV3CEx/rPqc6pFgBL6vF6Oai/ThXnTmSw2mz/yUR7daaeVr7iqUAOKLbpkhbF3H4\nZmt4Tp5xJ/tbZ4ZfL3zJ81+H/R521fYmoD5xB+pe3fSEG/eGVtmVtUjRjb9Zl6POZxlqsv8HgKEA\nJgAoAfB4sG9ARDcSUSYRZZaXW1/Po2sIv1BtObk+jn5Y/o/PD22aQf2kEkeOO3eYf0GFZ0neUG4q\npvR0J6g9xafCjsmujCi9oWrJhJCSvRCiVAjRJIRoBvA83E01xQAG6TYdqK3z9h7PCSEyhBAZSUmB\n15s2Sm8DJhf+KNu5tcOvecFdETChS2g1hq7MGGhUOLamL70xZ3Rg8/i2daFuJrWK6rqwY7IrlbpK\nGi2kZE9EybqH3wfQ2lPnfQCLiagzEQ0BkA5AqTqhc3Q3b77Iq5QYSeRI7BJabZYUnosWAHDHqt2u\n5Vmj+nawJfOHu636FkjXyzcAfAVgJBEVEdH1AB4hoj1EtBvATAC/AQAhRA6AtwDkAlgL4FYhhFJt\nHY9cMd61rFIfXBYZlkwe5H8j5lOjrhnnoR/wVKJ6fq+/hRBLvKx+oYPtHwDwQDhBRbJEXZNFqFMc\nMuYLT9oenhXv7HEtzx/b35D3FEIo8f/iqBG0RoiO4o/MKJ2j+bNkxvpgj/s+Wo+uxpR+VmW+Bf5t\nY9KocLbEIpdR3y9VvqeOTPZGdLtkjDlD6Sk1RiQ7Jtnru2T14Jl9wnKYR2hGPC76Zxx9fXw7c0yy\n/2R/mWtZ1UETVqkKYwo9Zp4+urEjnx3gicfD8evZ6a7lpS+q0XvcMcl+g67nzN9+NEFiJPZnZBPm\n7XNHuJbP1PPcoeG4ZYZ7RPLGvWUdbMn8UXHAn2OS/UHdkPThSd0Me18njtjrZGC2H9w73rWsSKcH\naXp3i3Utv511pIMtmT8qDvhzTLLPOnTCtRxt4FyylTXOG5reSfeteekn3wnrvXrHuxOU0ytfzhgZ\nXtmQy8YPcC3XNjjvJMRIqvTA0XNMstfr1jm0Wi6tZuuGtK/c7rwzqBc2F7iWZ44Mb3j/uQO7u5b3\nlji3gBcAxIf5vVQxQTHjODLZhytO13Xzq2+dVx/n7awiw95Ln6Be2lJo2PvaxatfFbqWozhZMxNx\nsg/B9BHuy+2tBYHO68K8iYtx/+HcXVQlMRI59BPq/HTaEImR2J/+Bv+EQT0kRhKZHJnsw73c1beN\nsvDwJDJuo5MTZIdga/oeSM9dO0liJJHJkck+XF1ieAQuMx6B//CF45dv7HAt903oIjGSyMTJnrEI\nEcuF4ZiJ+NvFGGMO4Lhkf9P0obJDYIwxyzki2euLQl16Lt9cZYwF53hNvewQwuaIZH9IV6WRuzKH\np65RqVkmGfNJP/n795/ZIjESYzgi2e8qOula5mQf2Xhe3/A8dqV7juTGJi6ZEI4Hv3+Oa7mmzv4n\nOY5I9q9+dUh2CCxATqqVX3XG+D9sCbo5kvUnOSx4fRPd3Tcrqu1fA8sRyb5BV04xKaFzB1syf/Rt\nl3fMGykxEvurMKGInr7u0z260bmMOSLZ7zriPsPhwRbh0V8l3TpzuCHvqS8n+/VB59Qa0k8m9eSS\niYa8Z2IX9yxsOUedU1iOZ+byzxHJ3gz3XDrGtazKHJWBeObTbw1/z2umprqWX/6y0PD3j1T6Kp9z\nRodXPdTpDpSqMXWgmTjZh2jykF6u5eo6nmEpHEunprmWi06clReIxfRXnF1jwytv3GrMgERD3sdu\nGvhmtF+c7EOkv2p8aUuB7w2ZX/qS0U7yry+M/944tbBcje6ESz8XL3PjZB8iAXe2f+3rwxIjYYzp\nZ5/7esUsiZFELk72IerfnW/0MhYp9Cdc0VHmpDW7zzfNyT5E3KuHscjx7o5i0/fR1GzvHj+c7Blj\nLAD6pls74mTPmMK4/7lx7P5RcrJnjDEfxqV0dy3bPNern+wPVda4lh/6wTiJkTBmjekjklzLdj8b\nle1H3xnkWt552N61hpRP9keOuwfpXD4hRWIk9qcvUmb0pOv6kgksPD/77hDZIShjyeTBruX9pacl\nRhI+5ZN9Y7O7u5Tdb7DIpi8/fLeuXIQRli8YZej7OVmv+FjXcj2PLA2LfpDaYV0rgR35TfZE9CIR\nlRFRtm5dLyJaT0R52r89dc+tIKJ8ItpPRPPMCjxQK7cfkR2CMvRdz3p2jelgy+ANTYo39P2cLL1v\ngmt5VVaRxEisN6JfN9Pe+xWbl0oP5Mz+ZQDz26xbDmCjECIdwEbtMYhoDIDFAMZqr3mGiKSOhf8o\n+5hr2aj6I61unTnM0PeLdPrCXUYPXOkS48ySCQBwVcZAQ98vNtr9f7Ot4Lih7x3p+B6Fb35/Y4UQ\nnwNo+41ZBOAVbfkVAJfr1r8phKgTQhQAyAcw2aBYI05yd2e1M28rNC9xdNJNIab/o6IqfZfIv/zw\nXNP280V+hWnvHYk41/sW6ulZPyFEibZ8DEDrZI0pAPTtJkXaunaI6EYiyiSizPLy8hDDkEvf9FBS\npX61xtXfmDdKsUecu1lo074y0/YTKQ4f18+LbF7xMhUmyvan6qz7XpIzy8AFJuxrcdFyihL0H1Qh\nxHNCiAwhREZSUpL/F0QgfdvoxzmlEiOxv566m4of56r/WR496Zw5EMymr3ip7z3DPIWa7EuJKBkA\ntH9bT8WKAQzSbTdQW6ek7rqz0Xvf5yngjKKv866qf35m/CQwTqVP9j8+n5O9L6Em+/cBLNWWlwJ4\nT7d+MRF1JqIhANIBbAsvxMilvxHGWDA+O2DPpstIpG/2c/KNfn/8dk8hojcAzADQh4iKANwL4GEA\nbxHR9QAOAbgKAIQQOUT0FoBcAI0AbhVCNJkUO2OM4aGP9skOwRb8JnshxBIfT832sf0DAB4IJyjG\nGGPGckw7hFOna2OMheeJxRNkh2AIxyT7OG7LY4yFQN811s4lox2T7BdNMLZwF2ORLLGLsaPFnSxK\nl+zzyqolRhIepZN9bYP73nC/RJ5G0Cg3TOOqipFuGReWM8zcsf1cy3aeh1bpZF+oq1Kn7xPPwtO7\nW2fZITA/esfz/5FRYnR1oJq5GScy/XfHUdcy10s3TrRJN7vPH9LLlPd1onm6s9ETDiiZYJWvD1bK\nDiFkSif7dTnuipdTh/WWGIn96ZvEEuPMaQ+eMbKva9nON8L8OVvv/iwH9jTnJER/U3GnA0YkW+Wx\ndQdkhxAypZO9vr5UfGe+YRWOugZ3W+WFw/uYsg9926i+NLVqDuhmPPrDJaNN39+d7+4xfR9OYefJ\nYJRO9lYY2scZk258nOtOvmY1iQ1Lck88ceKMuk0P+lou+vlizVJSxUXXmOLJ/mC5+dOI3TzDPYFJ\no43/6vtzx6rdrmUzS/K2qm1Q97PMPHTCtWz0hDpO1plrVXWIP50w6SfeztL9ErPw3LcmV3YIptno\ngHr9Mswd2192CBGNk32Y9FX2bl+1S2IkzC4qTtfJDkFJZvUSUwUnewMdOa7+bFUsfMUn+XtilMOV\n7hm/ppnUcUAVnOwZY7ZVWeO+Srrk3GSJkUQ+TvaMMdtaqxtLwxOXdIyTPWPMtrKLqyzZz5jkREv2\nYyZO9owpakS/bv43srkt+daUL+ibaP9aQ5zsGVNU52hu1jDK7FF9/W8U4ZRN9vraKhMH95AYCWNy\n2LlCY6SZp+vDb9fxNMom+2bd93x6uvlD0p1izmjrznBULoZmhZH9EmSHoIy+uvkwKqvtOU5C2WR/\nsNw9o8wN3+XJNoxiRamEVifONFi2LxVNSuvpWi49xfVxjHK6ttH/RhFI2WRfrhulmNDF3IlLVK+V\nry/c1Ts+1tR9LZ2a6lpWsRlCXz9pxkhzrzgvHecu5bElv8LUfTnJtoLjskMIibLJ/p1vii3b14M/\nGGfZvmTQj/i8ekpqB1uGL6qT+ytp5yngfKnTHVO8yUXQ9PMONKv3d1OalZlHZIcQEoWTfZFl+xrS\nW+0yx7lHT7mWE02+Sho/qLtr+axuwhRVROnqt5jdIqZvcjtTb8+mB2YcZZO9lRqa1TsD1Xv6k3zX\ncp8Ec5tx9L0eMgvtebkcie55L0d2CKYalqT2CZcRONkboLHJfY2sYg+S/DL3zW6z66/rh7wXVJzp\nYEt7OqQr3HX5hBSJkajltjkjZIcQ8TjZG2B4X/dIxaITXNHQKP/87FvZIRjuyHF3sk9zyCxnVlhw\nDtey94eTvQGsbIdl9qZvEuOmB+NER3Eq84c/IYM9sSFPdggsgu08ctK1bOWYBRVV1/FN52BwsjfY\n21nW9QJizMl2HLZn2QJZONkzxmyppIpHBQeDkz1jzJb0N7uZf5zsGWO29NSmfP8bMZewkj0RFRLR\nHiLaSUSZ2rpeRLSeiPK0f3v6ex/GGIt0f7hktOwQwmLEmf1MIcQEIUSG9ng5gI1CiHQAG7XH0qxY\nMErm7hljiviurlT62Xr7lfIwoxlnEYBXtOVXAFxuwj4CduP0oTJ3z5hUP58xTHYIyuib4J6acP3e\nUomRhCbcZC8AbCCiLCK6UVvXTwhRoi0fA9AvzH0ErVlX4o/7Mhsnlgeu2M4VkwbKDkEZPXXlve14\nczjcQifThBDFRNQXwHoi2qd/UgghiMhrsRjtj8ONADB48OAww/D0dYE1kxA7zaNXnis7BBakYUnu\nUh61DU0etYdY6JpsWDM6rFM1IUSx9m8ZgHcBTAZQSkTJAKD9W+bjtc8JITKEEBlJScZO4qB4EUpp\nRvVPtGQ/d186xpL9yCRjysDPD5Rbvk9V6SehsYuQkz0RxRNRQusygLkAsgG8D2CpttlSAO+FG2Tw\nsVm9RyChi7nVIGU5XeueGnBEv24dbGmcn16YZsl+ZLptTrrl+1RxfgBZ1uwp8b9RhAnnzL4fgC+I\naBeAbQA+EEKsBfAwgIuJKA/AHO2xpbZKmDbs9rkjLd+nFcp00ztadf9Dvx8VS0YDwIXpfSzf5zEe\ncWqYg+U1skMIWsino0KIgwDGe1lfCWB2OEGFa/+xU/43MpiqZ/ayc+2e4iqcO7CH3CBMECeh7fzl\nLwtx00XcO8eplOxeIaNmxjQJZ2pWkH1mbcczqEDESOjZpFItmfyy067l76TxuM1AKJnsdxdVWb7P\nvgldLN+nFWQ3oqxYvUdyBMY5pbv/wcJT2+C+QdovUc3fPaMpmexls2MfXF/+8anc2aJUuqn42X7u\nDWOU4pPuGeEev6pdazLzgpO9CdblHJMdgmHe3VEsOwRl7JNwL0lV23SdMDpH89iBQHCyN8FrXx+S\nHQKLQH//RL05dZl9cLI3QWGlOs04jEWi+kb7DWqSjZM9Y8x2XuWr56Apneznj+0vOwTGmEKumZIq\nO4SQKZ3so6O44iVjzDiDe3WVHULI1E72nTjZM8aMM3NUX9khhEzpZD99hLHVNBmzoyu5pr1h+nRz\n17Sva7TXGBClk/2Fw9UsYcDsLbW3tU0Bs3Rno3acTi+S9OjqTvaNTbLHlwdHuWSvn1SgR9cYiZGo\n5a2bplq6v+suSHMty67PY7Q5o62dvE1frFSlEcmyNdvse6lcst9b4h6lGNNJucOTZnSytZNtDOjh\nrnfSaMNZgToSZfG9pJQe7iuJ3UUnLd23ygoq7FWkT7lsWHrKXdmPp581TrTFfzgTu7ivyux2BuWP\n1YW70nWTzmwvtH6uB1Vd9vQW2SEERblk/4v/7HAt82TjxomLtbb+iP5+i2K5HlecZ+0NU/28s/+3\ny34zLDFjKJfsZbZJvvGzKdL2bYZmic0n3buqe2Yvs3XRjnOnMmMol+xlmjqst2tZhV+q7z7yibR9\nezbjSAvDFFY3iek1KfCHU7Ub9lbhZG8SFX6p9DXDZapVoAeJ/o+/zDP7E2fsP4FK0Qn39/KuhaMl\nRmIvnOxNknuUa5cbZVVWkewQwrbziLsXjMz66ypUi9Q3610z1b61aqzGyd4ksmd4UokKyf7JTfmy\nQ1DG53kVrmUZc/naFX9SJqlXoM0+UuSXVcsOIWxb8iv8b8QCov/jb/WYBTvjZG+ST3m+UabT1eKu\nq4y1xcmeMWYrdZJv2MfatOnInlEzZjf275wVMfYdOy11/3atwmLTsBmzl9N1jbJDYAbpb3G5C6Mo\nm+x/aPGQdMaYMzzw/XGyQwiJssmeb4jZX0KXaFPe98tvK3Cq1v6Di4Lxq1nDw3p9RXUdsg4dhxAC\nG3JLPUqJ+3tdpmLF1y7QjZTfuLdUYiTBUTbZR1qXrLJTtfjP1sMhvfajPSU4UOq7nXJtdgn2B9CO\n+d7OYhRU1OClLQWoCmIkZaJJSdeflB5xhr9n1dkG/Pj5rbj51Sy/25afrsPv3tqFzXn271l1+cQU\nr+v3lpzCxznHfL7u9a2HsD63FBn3b8AP//EV/vR/ubjh35lY/NxXAe034/4NuOKfgW1rR9e/kolP\n9pXJDiMgcn6LTVBZXYfH1x+QHUY7z3yaj20Fx11dMRuamrFUm5jjUGUNXtpSCAC4ftoQDNJNZrzv\n2Cm8v/ModhdV4Qutj3bhw5dgfW4pKqrrsGTyYNe2N7/2jet5vVVZRegaG4WF45KxKqsIt7+9y/Xc\n1wcrMaBkRPgdAAAN2klEQVRHHH5ywRAcO1WLe9/Pwdwx/XCmvhF3LhyN/92Q59r2T4vGGveBBCE2\n2n0uIoTwqGJaWFGDG/6diWumpLo+z1aNTc0YftdHAICbLhqKH0wciJH9W+rxN2jjH/YdO43nPz+I\nbl2isTb7GEb2T8CKBaM89nHr699gW+FxvPNNEQofvgT/2nwQYwYkAgB+/PxW7LznYnywpwR/+G82\n8h9Y6PUEI+vQCXz1rfw+9vrvlhACd76bjQOlp5F16AQAIOdP83D/B7m4c+FoJGh1iQ5V1uCud7M9\n3uflLwsBANsLT4QVz9Ob8jB1WG9MSu3l9fn3dhbj12/uxL775uPj3FKsyzmGC4f1wY/PH+x1eyu1\nraa7Oa8CmYeO4/a5I9s9V9fYhJF/WIt/Xj0J88/pb2WY7VAkFBXKyMgQmZmZIb32ltezkJHaC9lH\nq7D6m2LX+uevzcDFY6ydEQgA0pZ/4Fq+59Ix+POa3HbbtCblBU9s9phsBQD+etV4/PatXe1e480t\nM4bhy28rXUPxCx++BMtW7UZOSRWyiwMr1zA0KR4Hyz0nYdj8+5keRdD23Tffo0yuVdZmH8PNr7Wc\ngV80IgmfHShHr/hYzBiZ5PF/DbRMLr/ypqmYlNoTn+4vw3Uvbfd4fmhSPJbNH4WbAjijv2n6UFw4\nvA+ufXGba12XmE6obfAcKBcTRWjQpqa7dmoqfj9/FC576gsc1Ca1uHrKYLz2dfurubZ/lK3Q1Cww\n7M4PAbTMo1pRXe/x/OjkRI/v4ryx/ZBZeAKVNZ7bebPvvvkYdfdaAMCtM4fho+xj7b5Tg3rF4Wff\nHYp73svxWL9iwShkHz2F+y8/B+P/9LHfffWKj8VxXUwyPkvA8/dc74WlGbj+Fd+57JezhmPH4ZOu\nE7jXrj8f09JDnz6ViLKEEBkBbWv3ZO/rQz/44EJ0ktCU4yseO5P1WWYWHleyCUBGghJCYMiKDy3f\nr9kiLdmH4sklE3HZ+AEhvTaYZG/rNvvK6jqfz8lIToCavYBkfZZJCZ2l7NdMf5bUJKbiRD5xEq42\nzfCrN3b438gAtk72x3RTEEaKmCj1fqlkSe0dLzsEww3po94xyXL+UO/t/cw705I9Ec0nov1ElE9E\ny83YRwS0QLWj2uTYzFhjkhNlh6CMGSOSZIdgK6YkeyKKAvB3AAsAjAGwhIjGGL2f07WRNyoxEv8A\nscjRKz5WdgjKmDC4p+wQbMWsM/vJAPKFEAeFEPUA3gSwyOidPLEx8rpa/nbuCNkhsAimYtu5LN06\nq9FmbxWzkn0KgCO6x0XaOkMFOorPSt06KzN0gbGIJnPGLzuSdoOWiG4kokwiyiwvD22EYt+EyCtI\n1D0uxv9GjLGw6QeKMf/MSvbFAAbpHg/U1rkIIZ4TQmQIITKSkkK70dIznhMrY4wFwqxkvx1AOhEN\nIaJYAIsBvG/0Tvp0U68fNmMs8o3o1012CEEzJdkLIRoB/ALAOgB7AbwlhMjp+FXBWzo1zei3ZIwx\nvy4ZF9qIV5lMa7MXQnwohBghhBgmhHjAjH309NGNLYFvkhrmkR+eKzsEZpCrMtQb3S3LpFT7dfu0\n9QhaX1YsHC07BGVcNsF+ZzDMu3u/J6dUg4r6JBg3XsKqgXa2T/b775/v8XjhuP5YMnmQj62t8evZ\n6V7X3zFvpMWRuK27bbrHhC76AlLThrdU3fNWJVT2vAC3zhxm2nuPHdDxL9njV45vt+7tm6cGvZ+J\ng3sAAB69Qu5VUryPK96+HdQgyv3zvA7bp/19hqoa1T8xoHLLGQFcAay+5QIjQvLL9lUvAeB0bQPq\nG5tRXdcYEfVUhBA4cvwsBvWKw7qcY65681l/mINJ929AXEwUzjY0uUoJ3zJjGH70nUHo1jkasdGd\ncKjyDFJ6xKFnfCwOlldj1uOfed3Pht9Ox4Aecag4XY+HPtqLj7Ldk1Bcd0Ea+nSLxWMfH8B5g3tg\n9S0XoqauEXll1RiX0h1RnQg1dY0429CEnl1jcfTkWQzq1RXHa+oRHUXoHN0Jx2vqkdzd+AlEgtHU\nLHDJk5s9Jpm+bU46Sk7WYmWmeyjHU0sm4uIx/XDiTD0251Xg3IHdUXKyFlVnG1B88iweXbe/3Xvn\nP7AAH2Yfw/Ckbvjy2wpMHtILKT3iUFJVi/jO0RjSJ96juuFLP/kOZo7si+ZmgaF3uitIvrA0A2MG\nJKJZAE1NAtMf/QTjUrrj9Z+dj6ozDRjUqysOV57BoF5x0gdVnaipx/0f7MU73xThugvScNucdBw5\nfhbfe/oL1zYj+nXDgdJq/GLmcNw+byRqG5pQdbYB5z+4EQDw5o1TsPi5r/HYleNxxaSBeGpjnmsu\niZsuGopnPzvoN46hfeLxi1nDPcp5b/79TMTFRuHnr2Vhe+EJXHdBGpZMHoyfvrwdxSfPIqoTucbW\n7Lp3bkR0c25qFjh68qyrJPhz10zCja9m4fsTU/DujmLcdNFQLJs3Cnll1cg5WuU63u5xMag62zKB\nUDiVO4OpegkhhPSfSZMmCVUdrqwRqcvWiCkPbhBCCNHU1Cyam5tFU1Ozx+OONDY1u34e+nCvmPno\nJ67Xt2pubhYNjU2iscn93tsKKkXqsjXiB89sMeHIrNPc3CxSl61x/ZyoqXN9hnMe/1SkLlsj9pWc\n6vD1jdrn3Pq6tp+fL637bLv9S18cFGPu/sjr+wTyfypTdvFJkbpsjSisqHata2pqFjsPnxCpy9aI\nw5U1rs9Lb/GzX4nfrNzh2r5V28+09XOe+dgnInXZGvHylgLR2NQsHl+3z/V5frD7qCg5eVakLlsj\nthdUikbd++UUV4nUZWtEQbk7vvrGJnGmrlGkLlsjRtz1oSmfSzhaj0uIls/mYHm1SF22RuQerfLY\n7qJHNokHP8x1fScD/R76AiBTBJhnlTizj2RNzQI3v5aFmy8aZvlNnaxDJ/DDf3yJacP74LUbzrd0\n30b77cqduHR8MmaN8mxq2l10En9bfwDPXZuBmCjjWyUf+nAvhiV1w1Xfkds0qIrWK6VQzmaFEPjF\nf3bgf6YMxgXDQp/wwwxvbDuMwsoarFhg7f1CR01ewnxrbhb424YDuGZKKvomRt5oY+Y8m/aVor6x\nGfPPSZYdihKCSfbcR1FhnToRfjdX3k1hxtpqe2XGrGP73jiMMcb842TPGGMOwMmeMcYcgJM9Y4w5\nACd7xhhzAE72jDHmAJzsGWPMATjZM8aYA0TECFoiKgdwKIy36AOgwqBw7MBpxwvwMTsFH3NwUoUQ\nAc3rGhHJPlxElBnokGEVOO14AT5mp+BjNg834zDGmANwsmeMMQdQJdk/JzsAiznteAE+ZqfgYzaJ\nEm32jDHGOqbKmT1jjLEO2DrZE9F8ItpPRPlEtFx2PMEgokFE9AkR5RJRDhH9Wlvfi4jWE1Ge9m9P\n3WtWaMe6n4jm6dZPIqI92nNPkjbRKRF1JqKV2vqtRJRm9XF6Q0RRRLSDiNZoj5U+ZiLqQUSriGgf\nEe0loqkOOObfaN/rbCJ6g4i6qHbMRPQiEZURUbZunSXHSERLtX3kEdHSgAIOdP7CSPsBEAXgWwBD\nAcQC2AVgjOy4gog/GcB52nICgAMAxgB4BMBybf1yAH/Rlsdox9gZwBDt2KO057YBmAKAAHwEYIG2\n/hYA/9SWFwNYKfu4tVh+C+A/ANZoj5U+ZgCvALhBW44F0EPlYwaQAqAAQJz2+C0A16l2zACmAzgP\nQLZunenHCKAXgIPavz215Z5+45X9ixDGBz0VwDrd4xUAVsiOK4zjeQ/AxQD2A0jW1iUD2O/t+ACs\n0z6DZAD7dOuXAHhWv422HI2WgRsk+TgHAtgIYBbcyV7ZYwbQHS2Jj9qsV/mYUwAc0ZJRNIA1AOaq\neMwA0uCZ7E0/Rv022nPPAljiL1Y7N+O0fqFaFWnrbEe7PJsIYCuAfkKIEu2pYwBa53Hzdbwp2nLb\n9R6vEUI0AqgC0NvwAwjO/wL4PYBm3TqVj3kIgHIAL2lNV/8iongofMxCiGIAjwE4DKAEQJUQ4mMo\nfMw6VhxjSLnPzsleCUTUDcA7AG4TQpzSPyda/mwr012KiC4FUCaEyPK1jWrHjJYzsvMA/EMIMRFA\nDVou711UO2atnXoRWv7QDQAQT0RX67dR7Zi9ibRjtHOyLwYwSPd4oLbONogoBi2J/nUhxGptdSkR\nJWvPJwMo09b7Ot5ibbnteo/XEFE0WpoUKo0/koBdCOAyIioE8CaAWUT0GtQ+5iIARUKIrdrjVWhJ\n/iof8xwABUKIciFEA4DVAC6A2sfcyopjDCn32TnZbweQTkRDiCgWLTcw3pccU8C0O+4vANgrhPir\n7qn3AbTeXV+Klrb81vWLtTv0QwCkA9imXTKeIqIp2nte2+Y1re91BYBN2tmGFEKIFUKIgUKINLT8\nf20SQlwNtY/5GIAjRDRSWzUbQC4UPma0NN9MIaKuWqyzAeyF2sfcyopjXAdgLhH11K6i5mrrOmb1\nDQ2Db44sREsvlm8B3CU7niBjn4aWS7zdAHZqPwvR0ia3EUAegA0Aeulec5d2rPuh3bHX1mcAyNae\nexruwXJdALwNIB8td/yHyj5uXcwz4L5Bq/QxA5gAIFP7v/4vWnpQqH7MfwKwT4v3VbT0QlHqmAG8\ngZZ7Eg1ouYK73qpjBPBTbX0+gJ8EEi+PoGWMMQewczMOY4yxAHGyZ4wxB+BkzxhjDsDJnjHGHICT\nPWOMOQAne8YYcwBO9owx5gCc7BljzAH+Hwbqm2M7x3+uAAAAAElFTkSuQmCC\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x11aec8320>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "# the noise on the baseline should disappear when zero_suppress is implemented\n",
    "plt.plot(zero_suppress(waveform, 15.0))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 3 - Memory Management.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 3: Memory Management\n",
    "\n",
    "## Managing GPU Memory\n",
    "\n",
    "During the benchmarking in the previous notebook, we used NumPy arrays on the CPU as inputs and outputs.  If you want to reduce the impact of host-to-device/device-to-host bandwidth, it is best to copy data to the GPU explicitly and leave it there to amortize the cost over multiple function calls.  In addition, allocating device memory can be relatively slow, so allocating GPU arrays once and refilling them with data from the host can also be a performance improvement.\n",
    "\n",
    "Let's create our example addition ufunc again:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import vectorize\n",
    "import numpy as np\n",
    "\n",
    "@vectorize(['float32(float32, float32)'], target='cuda')\n",
    "def add_ufunc(x, y):\n",
    "    return x + y"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "n = 100000\n",
    "x = np.arange(n).astype(np.float32)\n",
    "y = 2 * x"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 111.70 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000 loops, best of 3: 1.34 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(x, y)  # Baseline performance with host arrays"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `numba.cuda` module includes a function that will copy host data to the GPU and return a CUDA device array:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<numba.cuda.cudadrv.devicearray.DeviceNDArray object at 0x112327f28>\n",
      "(100000,)\n",
      "float32\n"
     ]
    }
   ],
   "source": [
    "from numba import cuda\n",
    "\n",
    "x_device = cuda.to_device(x)\n",
    "y_device = cuda.to_device(y)\n",
    "\n",
    "print(x_device)\n",
    "print(x_device.shape)\n",
    "print(x_device.dtype)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Device arrays can be passed to CUDA functions just like NumPy arrays, but without the copy overhead:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 429 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(x_device, y_device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a big performance improvement already, but we are still allocating a device array for the output of the ufunc and copying it back to the host.  We can create the output buffer with the `numba.cuda.device_array()` function:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "out_device = cuda.device_array(shape=(n,), dtype=np.float32)  # does not initialize the contents, like np.empty()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "And then we can use a special `out` keyword argument to the ufunc to specify the output buffer:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "1000 loops, best of 3: 235 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_ufunc(x_device, y_device, out=out_device)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now that we have removed the device allocation and copy steps, the computation runs *much* faster than before.  When we want to bring the device array back to the host memory, we can use the `copy_to_host()` method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]\n"
     ]
    }
   ],
   "source": [
    "out_host = out_device.copy_to_host()\n",
    "print(out_host[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Exercise\n",
    "\n",
    "Given these ufuncs:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "import math\n",
    "\n",
    "@vectorize(['float32(float32, float32, float32)'], target='cuda')\n",
    "def make_pulses(i, period, amplitude):\n",
    "    return max(math.sin(i / period) - 0.3, 0.0) * amplitude\n",
    "\n",
    "n = 100000\n",
    "noise = (np.random.normal(size=n) * 3).astype(np.float32)\n",
    "t = np.arange(n, dtype=np.float32)\n",
    "period = n / 23"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Convert this code to use device allocations so that there are only host<->device copies at the beginning and end and benchmark performance change:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [],
   "source": [
    "pulses = make_pulses(t, period, 100.0)\n",
    "waveform = add_ufunc(pulses, noise)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[<matplotlib.lines.Line2D at 0x112119160>]"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    },
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAXQAAAD8CAYAAABn919SAAAABHNCSVQICAgIfAhkiAAAAAlwSFlz\nAAALEgAACxIB0t1+/AAAIABJREFUeJztnXd4VGXaxu8nCaGX0EMIhBJBeolIE2mKiApibx+uBdeG\nq6su1tV1VfRTVz87omtFRVBBEKQICghI6CVAQmihJIEEEkIqeb8/5iRkZs7MnJk5Zc57nt91cTHz\nzinPmZy5z1ueQkIIMAzDMPYnymoDGIZhGH1gQWcYhpEEFnSGYRhJYEFnGIaRBBZ0hmEYSWBBZxiG\nkQQWdIZhGElgQWcYhpEEFnSGYRhJiDHzZM2bNxdJSUlmnpJhGMb2bNiw4bgQokWg7UwV9KSkJKSm\nppp5SoZhGNtDRAe0bMdTLgzDMJLAgs4wDCMJLOgMwzCSwILOMAwjCSzoDMMwksCCzjAMIwks6AzD\nMJLAgs4wEcTSndnILiix2gzGprCgM0wEcdfnqbj2gz+sNoOxKSzoDBMhVBVsP5RXbLEljF1hQWeY\nCEHRc4YJGRZ0hmEYSdAk6ET0MBHtIKLtRPQ1EdUhoqZEtISI0pX/44w2lmFkhjvoTLgEFHQiSgAw\nBUCKEKIHgGgANwKYCmCZECIZwDLlvW0oKCm32gSGcUPwnAsTJlqnXGIA1CWiGAD1ABwBMB7AZ8rn\nnwGYoL95xpB2tAC9nluMORuyrDZFCjYfOonKShYjhrGagIIuhDgM4DUABwEcBXBKCLEYQCshxFFl\ns2MAWhlmpc7sPlYIAPhtT67Fltif1RnHMeHd1fhk9T6rTbE9/EhkwkXLlEscXL3xDgDaAKhPRLfW\n3Ea4xoqq9yMRTSaiVCJKzc2NDAElcv3PP6DwuWXGOgDAnuxCiy2xP1UzLlX3J8MEi5Ypl9EA9gkh\ncoUQ5QC+BzAYQDYRxQOA8n+O2s5CiOlCiBQhREqLFgErKDE2RQjgizX7cbq0wmpTbIvgLoauZBeU\nOG5aVYugHwQwkIjqEREBGAUgDcA8AJOUbSYBmGuMicbx05YjOMtzv7qwOuM4npm7A8/P22G1KbaH\nO+j6cNvH6/D377bg1BnnOEBomUNfB2A2gI0Atin7TAcwDcAlRJQOVy9+moF2Gsb8rUesNsG2fLLq\n3Lz5kVOu/CP5Dvrx6MnmQydRVlEJAKgUrvdM6BSWlGNP9mkAwO7sQiRNXYANB/Ittsp4NBWJFkL8\nE8A/PZpL4eqt24bKSoGOT/6Mlg1rV7eVlJ+10CL7cvRUMf41f6dXe2kFf5/Bkpl7GhPeXY3r+ret\nbpvw7mrsnzbOQqvshxACV76zCn+9uBO+Wnuwuv36D9cAcHXe+reXO1zGUZGilcqqU05haXXbP+Zs\ns8ocWzPmP7+rtq9MP84PySCpGtV857D5Xr2pFMD2wwV48OtNWJN5wutzJ7j5O0rQffH9Rv4hBUtB\nie/FTxZ0fXj42838XQZB1dqDL+Hem3vaNFusggUdwCOztlhtglSUlFdabYKt8OWm+MOmw3jqh+3m\nGmNjAnXAV6YfN8UOK3GMoJdVVGJWqu+e+InTpT4/Y4Jj4MvLsPGg/AtQZjCHR49MEDhG0P85bwee\n/MH3fPmbS9NNtMbebNIg1hsd4FGgFxk58k8FMObgGEH/+s+Dfj//Yu0BnDxTZpI19mbTwcAudaUV\nPO2ilcdnb/X7eV4R35daOFUc2GV2wdajAbexM44RdC08/O1mq02Qhv/9ZbfVJtgCLVNTgQSfcZFX\nFHja9P6ZG6V2rWVBr8Hy3ZGRa4ZxDhPfC1w/dGlaNjId4KERLlrdEo+dkrcINws6o5kjJ4txzft/\naJ6aYpc7/dhxpMBqE6ThizUHrDbBMFjQGc1M/z0TGw7k4/9+zdC0/eVvrTTYInsTTEGLfceLDLRE\nDvafOKNpuxmr5E31zILOGEYmi5BfngzCx/yNJXsMtEQO7v481WoTLIcF3YPFO45ZbQLjEAJ5XjFM\nsLCge8CVd5hIhTMwMoFwhKCvCiLk1wkJfBh74oT0r6HCBbZdSC/ouYWluPXjdZq31xKcwDBWwKLl\nmxVB1geW9buUXtALSoIT6F3HCrFOJfUmA3y+Zr/VJjCMKmlHg3PrXL5btWKm7ZFe0EN5Et8wfa0B\nltifUKr1cZIu/Xh/xV5pe5bh8uqi4CKT7/hUTo8Y6QV9by67zulBqHUZ12Xm6WyJczlRVIatWaes\nNoOJYKQW9FNnynHPFxusNkMKMkIMPX9l0S6dLXE2FVzUXDdkjGSWWtCzC+XN2cDYm0N52qIaPSmV\nUITCJT27MKT9uj6zSGdLrEdqQWeYSOXjEMPPT7IXlhePfscVx6qQWtAreQFJN3yVSWNC49M/9ltt\ngjQUlfGopQqpBb28ggU9EvgjQ/5ajmYxK/WQ1SZEHFzx6RxSCzr3KiODm2doD+xi/LOCc/YzfpBa\n0KPCUPSs/NAWrWSFn4368eOmw1abIA1zN4f3XVZK5jUktaD/EkbmRK6JyRjFmr0ciawHOQUleOib\n8MpGllfK9TuXWtDfWpYe8r7Ld8kZGsxYz7c8D64LZWflEmM9kFrQw+HfC9KsNiGikGxkamtC9WFn\nvJHNEY4FndHE+yv2hrV/Dgd56QZX5tGP//n4T6tN0BUWdD9w9aJzLE3LDmv/5+bt0MkS5nRphdUm\nRAR69K7/3C9XriEWdD88+cM2q02ICPQQEMnWniwlv6jMahOYCIUFnQnIV2sPhH2MRTuO4UwZ9yz1\ngCMjGV9IK+ilFf5v+imjkgMegxcCXRTrlBDqyMliXY7DMIC2wMGnx51vvCERhCZBJ6ImRDSbiHYR\nURoRDSKipkS0hIjSlf/jjDY2GALN2V7dNyHgMfJ4aAsA+H4jB8LohYwpW63ioAZvn7su6miCJZGD\n1h76WwAWCSG6AugNIA3AVADLhBDJAJYp7yOGr//07+vboXl9bH72EpOssTdafjiMNgJ1NJ4Y29Vx\nvcpQOX5aW4frjet7G2xJ5BBQ0ImoMYBhAD4GACFEmRDiJIDxAD5TNvsMwASjjDSKJvVi0a5pPavN\nkII59w622gRbcOCE/4dj33ZxuOuijnj12l4mWWRfApXju31wEgBgYr+2JlgTGWjpoXcAkAvgv0S0\niYhmEFF9AK2EEEeVbY4BaKW2MxFNJqJUIkrNzY2MxEIrHx9R/fq3x4ZbZ4hE9G8fUTNuEcuaAAXI\nU5Tv8fqURMy8+0Kf2y3YetTnZ4yL567qbrUJpqNF0GMA9APwvhCiL4AieEyvCNejUvVxKYSYLoRI\nEUKktGjRIlx7dYc4JaNu9ExobLUJtubruwciKurc/TioYzOf294/c6MZJkU0q9I5LbMnWgQ9C0CW\nEKIqB+psuAQ+m4jiAUD53zbJT9rG1bXaBKkY2LEpAOArPz1KJjCdWtZ3e8+dDf98tyHLahMijoCC\nLoQ4BuAQEXVRmkYB2AlgHoBJStskAHMNsTAE/Lkszn9wKP9QdKJvuyYAgKfHdQMANKpTy0pzbE/L\nhnWsNsGRyOR5FKNxuwcBfEVEsQAyAfwFrofBLCK6E8ABANcbY2LwPDHHd4RnD5VpgfQXx6K0ohI9\n/vmLkWZJx7s398PMdQfRvU0jTduzX79v5tw7yGoTpOHqvgm4rEdrzdt/snof7hve2UCLzEOT26IQ\nYrMyD95LCDFBCJEvhDghhBglhEgWQowWQkRMUoTvgywgUCs6Cg1qa322MVW0aVIXj47p4jbimXXP\nIHx990DV7WeuO2iWabaja2ttD0UmMC9P7Ikx3bUL+quLdhtojblIGymqFxxcpM6Ukeo9mgEdmmJQ\nJ/XFPP4ufVPfR4diRJfIcySIdOrUivZqi3LILKujBH1AUtOg9ykoLjfAEvszXkOkrSdnHJ6DxFel\nolaNavvc5+YL2xtljpQ8flkX1faxPeNNtsQaHCXo//3LBUHvM2/LEQMssT+1Y4K/dcodXmHm+43q\nXhnf3eM7KCsm2iFdS53wNRf+n+v7IKmZ/EGEjhJ0X8Naf7yxZI8BltiftnHB/zi2ZJ00wBL7sPFg\nvmp7Oz9Cc3FyC3RoXt/n54w2YmOi8MFt/XFRcnOrTTEURwl6IPwNfZ3Kf1fv0+1YJ884e/pqb25R\n0PtERRHeu6WfAdbYm1DK8HVt3Qhf3Cl3rAQLeg3G9wl+XlhmFm0/hud/2unV/qXkPwozSWgSepBb\nZu5pHS2xFy8v1Lfmryy5+lnQa9CqkXpgx4YDEeORaSoH89R7lEMlH7aayeRhgdO7tmmsLvpvLk3X\n2xzbUFCsrwD/uEmOtTIWdA3M2yzHHztYojii1nBSkgInNWtcTz0Cd9F259a81fvWDLdmbqQgnaCf\n8jFPu+Hp0QH3bVpf/YdT4dAQRyNSJDh1tGMEZQ71Gko7WoCVYSTmGqfiwvjrLtukovKLdIK+8ZC6\nJ0GzBoEXPBvWVhf0sgpn/nCMKBn34MxNuh/TDvj6LgnaHprj+7TR0xxb8/ma8GrcCvXEsFIgnaD/\nFIbfeNum6nOVTvWf3hqGm+G7N6t7Zhw5VRLyMe3M6VL1Od84H6NCT168umdQx2V8M/p81dINUiCd\noKvVv7x/RCdN+3I+DXdOl4Ye2TmulzMi87Tiqx8e72PB0xNfuYZumbFOtV1mcgrC6xSktA8+Ytwu\nSCfoajw2pmtY+8s7QPNP2tECr7ZEH6MYNRrW4YRnVRi1vrzlkPOCtXaq3JeMC0cIerhUnHWqpHuj\ndc4XAJ670nklwHzj/b1N7MdxD6EQ7rNRZuctFnQNLNjG9RurCObHcE1/5xTnDYRaTzq5ZUMLLLE/\n4Xpf+XIDlQEWdA/m3j/EahMiGok7N4byz3k7vNp4Sio0isKM6pS5shbfUR70TmxitQkRzZRRyVab\nYEvUvFFuvCDRAkvsT2GJ93f5wIjOiA0hA2hNKs5WIiba3n1ce1vvwdpM9XzTelDpsOCio6fU/aYn\n9uNpFL2wu3hYxVmV3+KjY7qE3dk4/9lFYe0fCUh1R904fa0uxxnS2bvizu7sQl2ObRfu+2qj1SYw\njKmUS+D8IJWg68XgTt7Jp2ReGVej2OHVhRi5kXX9ggVdhcSm8lc2CYQReVyqKK1w1sOiQqdIY1lF\nyApukbS0Hwu6CtEqYlbpsOh/taAivXhxgb65rCOdZ1U8XELhIR9zxE4qvv3nPn2Su00ZpV6qzu6w\noKuglrzHST8aX+hVgT7c5Ep2Y/EOfVKz3jm0g2r76ozQMw/ajes/XKPLcerFyjnaYUHXyLRFzupV\nqvHyxF5Wm2BL9JpiMnIajHFhd2826QX9lWvUs9T5Q6j8Tbcf5vwRrRurV3Ri/KPmN60nrPP6MS+M\nbK2RgPSCfsMF7YLeJ7lVA9V2Nf9Xxj+//v1iq02QnmDy68hIvI4djRW77V3oQmpBbxPiH9pXGt1w\nQ46dSMcW6g9HRj+c3kP/4s4Buh3rR5uXm5Ra0L+9Z5Cux1ObimEYxlpaNOSpwCqkFvRw/Mk5z4Z+\n/GVIklfbobwz5hsSQYTaq74+hVMv6EXdWtFWm6A7Ugt6OMREO3wcqyPntfJOE3vTR/qkabAry/8+\nPKT9urdp7NXm9Du1fmxowjygg3yVi1jQg4B90UOjV1tvEcrK178AtZ1Ial4/pP0a1ZXTfzocQk1y\n9vr1vXW2xHqkEfSlO/UJ3qgiNtr7qT91zlZdz2En2sZpLz3ntW8TTqVQk4uSvXMFaWV8b+8qR/dy\nIrWQaN6gttUm6I5mQSeiaCLaRETzlfdNiWgJEaUr/8cZZ2Zg7vo8VdfjjenuXRm8pMIZ8f9qwRW+\nws61EM3TV278bfR5Ie8bFcXfJeObYHroDwGoGS45FcAyIUQygGXKe2mIVvvhOMTNpVLlOq9LCX2R\n2FfFeqfSvY26W2w4CIfcm4x/NAk6EbUFMA7AjBrN4wF8prz+DMAEfU2zltox8q2AM5FBnTC9K2JU\nOhuHTzp7TYJxobWH/iaAxwHUnHNoJYSoqp58DID3HIVJ6JWetCY9Erx7UVuyTul+nkiE+3r6kW/A\nQnqoC6oy8uq14eUXurxna50siQwCCjoRXQEgRwixwdc2wjXeU9UBIppMRKlElJqbmxu6pX5Qi8jv\nFh/esNbJiZA2Hsi32gRp+Hn70cAbBcmTl3fV/Zh2oFyl43Z9GFOBAPDE2PPD2j/S0NJDHwLgKiLa\nD+AbACOJ6EsA2UQUDwDK/6pJEIQQ04UQKUKIlBYt9Em/qoUoafx3zOcGnUr5BeKMA1IpTFu4S/dj\n1lGZDnTCFHpOYanVJkQ8AWVPCPGEEKKtECIJwI0AfhVC3ApgHoBJymaTAMw1zMoAqOUv12MhLhzP\nDpm4y0ce7mDomeDti55/pjzs40Y6RmRadGpFrY9+z7TahIgnnH7sNACXEFE6gNHK+4jhsTHhD0tH\ndG2pgyX2Z9LgpLCPMfPuC73a2DMjNNQE/eWF8ufr//SP/VabEPEE1Y0VQqwAsEJ5fQLAKP1N0oeW\nDY0JGhBCOG5+PTYm/PmrhnVqebWxnuvHz9uOWW2CNGzLOoWeKtHNdkCKmWY1YTBKc9NzThtz4AhB\nLed7rRBDqwPBgs5YjVq06JXvrLLt6FEKQVcjVgcRUvujFpfJXbF+33HzHlhqax8MoxU9ArTq+kjs\nVWHTYjbSCroedFO5YUrK5RZ0MzsmNv3NMBZw4rS3h8u1/Y1LJWzTDrocgl6m4p+qR6SnE6NF1e7j\nxnW957/1YMRrKww5LiMfe3OLvNoSmoSeMC4Qdh09SiHoH6zY69XWuJ4xIuREVPPaMCGx9JFhVptg\nS9RuQb2cE24a4B2cxD10C3lPRdCN4lSx3L7Td+uctZJxp3NL72IfTGDUtFuvfsZlPeL1OVAEIIWg\nm4kRkX+RxIETzi4NZxcuSLI0W7XpqPXGe7VtosuxLz7PO4J9b649vdmkFPRGdYxL15p53Hsuj9FG\nQ4el0d186KTVJkiDWme8hUGxJgCQns2CHjHUi9VPOJzWEzKSlo3kqxDjjwnvrjbs2PcO7+TVJrNL\nrdnBfLwoahFHT3nngdbzb682HGNC44pebaw2QRpGdvXOVv3Y7C0WWGIOvCyvDdsLeq5KBrYreum3\nyHHn0I5ebUWl8mcJNIKbBrSz2gSpmb9V/1S9kcIRkwt4sJdLBBFOzUZP1CLJ3ly6R7fjOwmHpcBh\ndMTsQtgs6Bah9sXXN3jxrcwhxaL1JooVnbEJajmN7IDtBf2rdQesNoHRiJpXwq5jBRZYwtidJ8Ya\nW7XptcW7DT2+Udhe0HcfK7TaBGlQG3lseHq0oedcv5/L3TH+2aBSErGNgWH/gH2rI9le0PdY4C8q\naz50tZw4zVTSi4aDZ4xAicSudp7ccqG+i8LxjevoerxI5eNVXKlIK7YX9GILsh+quUrKgBmPqe5t\n3AsHvPiz/JV2qhjRRd8KWFf1ZjdQvejaWo6UDLYXdCv4ZUe21SYYghkDj9q1nHvL6R1YZdeFu0jk\n5ykXWW2CLjj318V4QSb00Z0cXKRX7pEqztrVty5ICoqNj/uIkiSjqHSC3oTT5obMYROCN+rWckaO\n+e2HTxl+DrU6rTKyJvOE1SbYBukE/dYL21ttgm0Z/cZvhp+jaf1Yw88RCVzx9irDz/HAiM6GnyMS\n4Kkl7Ugn6DHRcgydZKVPor7TDk4mNka6n69mJHU0Cxvp7ggj5oG/uutC3Y/pVHwV5WWYYBjTvbXV\nJkQk8gm6AU/uIZ2b639QxlFcxgKkK7WijZeuAyfsV/tAPkG32gCJ6N22ceCNGE20b17PlPM4Ic9Q\nzwRz7st9NixmI52gy+J+FAmM7SlPrUWrGX2+d/5yI1iZnmvKeazErLWD8rP2W4yVTtCNLEvlNO4Z\n5p0LngkNs7oZDnFNN4XvN2ZZbULQSCfo1/Vva7UJ0mBWzhonZFzsYdI0wad/7DflPFYyZVSyKedZ\nuP2YKefRE+kEXdbEWTLz0e/7rDbBcOqYFFC1KuO4KeexEi4L6RtbC/onq+QXAicwx4ZDW3+UWJAw\njgmfxnXtH3lra0H/1/ydVpsgDflFZVabIA0rdsu/MGkWeSbel78/PsK0cxmFrQXdTGp7rKyrJd23\nM31fWGK1CdKwZq/80x5mkZlrXr0D7qE7CE9B/3NfnkWWmAMvRYTOZ2u4LCJjDQEFnYgSiWg5Ee0k\noh1E9JDS3pSIlhBRuvJ/nPHm+qdvO+PyhHgutu44Ynw2PStJjDMuEOaafuyJxDBGoKWHXgHg70KI\nbgAGArifiLoBmApgmRAiGcAy5b2lGBm84RmvNH/rUcPOFQn0NDBKVO9CD06mjUoZOl6UdS4BBV0I\ncVQIsVF5XQggDUACgPEAPlM2+wzABKOM1Mp9wzsZduzbBjorLe9bN/Qx7Nhm5OGIJH64b7Bhx16g\nUmlHrTasXSn1SGXw7eSBFlliD4L6ZRFREoC+ANYBaCWEqOqmHgOg2j0moslElEpEqbm5xq7+G+mD\nfqnDkivFGCi6V/V2VkoBI4OK4iTPL//g15vc3jfwKDLOuKP5V0tEDQDMAfA3IYRbaJ8QQgBQDToW\nQkwXQqQIIVJatLBvQIBZkX5OoHNLOQryasVpIxI98XRb7NyygannP1NmfPk7PdF0pxFRLbjE/Csh\nxPdKczYRxSufxwPIMcZEhmGCQeaMi7VjzM2n/58le0w9X7ho8XIhAB8DSBNCvFHjo3kAJimvJwGY\nq795DMMEyzu/Zlhtgm3p2tp99Fhhs/J3WnroQwDcBmAkEW1W/l0OYBqAS4goHcBo5T1jQ+w2rGT8\ns2RnttUm2JaXJvZ0e19QbK/fRsAVBiHEKvjO/jlKX3MYK9h+WP5sh07i8Mliq02wLf3auYfTHLHZ\nd2nb1ZpKmw2FGP+US+Rqx8jDmswTVpsQFLYV9J+2HnF7/+lfLrDIEiYU7rnYvXjGt+sPWWQJw8iD\nbQU9Pds9aU+UBclHFm6TO1rUSDyrtnN0I8OEj20F/Z3l7iv5dWPNdWcCgN3Zhaaf0wgqLahb1qZx\nXbf3W7Pkzo1jJD0SGlltAhMh2FbQPUlpb3xusKt6t3F7X2HDIrJqTP890/RztvbIQTJvyxEfWzKB\nuLovJztjXEgj6GaUnuvpES3qOUqwK2bmnJad46dLTT/nea3MjZ5kIhdpBN0MZM0Rvv/EGbf3kwY5\nKxGZnmzjqSPGQljQg6BNk7qBN5KAS7o5KxGZnny1zvziFv1NmG60gjkb3GvN9jYwpbMssKAHwSXd\njMu3HkkMTW5utQm2ZWma+SmN6sV6xweeLrVXhKMa6/a5+4Bf3tOcLJ3xHus7dpqSZEEPAs6ax/hi\na9ZJXPTqr1abUc0sCf36zUoV7JljPu2ofbzZWKEYBq5I1XB84V9fvAeH8uwVJh7pZOYWub0f2bWl\nKeeNiXZfLMsrMn+hO1RsJ+iPz96CpKkLrDbDcE6dKcfjs7dw4iyTuOHDNej6zKKQ9/flwFrPgvgI\nAEjPsU+vEgAueeM3XPfBH25tqQfy3d5b5ZPwzNwdFp05eGwn6LNSswJvJAFvLtuDWalZ+HLtAeyR\nJIDJSvKKyiD8BFBtPHiy+vWhvDM4EaT7oa9jmyVCvRPdC6QXl9kr8jY95zTW788PvKEJhBNnV3G2\nEqeKy/UzJkhsJ+iRxtFTxgyzq26ql37ehUv/87vtsr6Fwq5jvrM+vrciAzdOXxPScfdkF6LfC0sw\n88+Dmra/6NXlGPDSMuw+VmiblARN6tZye//j5sgN1MopLInoB059DaMqIQT+9dNOrwXTzk8tRO/n\nF1t2fbYS9Kd+2KbabqXbVmm5/yyBL/2chnd+Tcf2w+H5J5884/upnzR1Af4xe2tYx48ELn9rpWq7\nEAKvLtqNtZl5QR/z/5al45ftxwAAK/cc17zf2UqBMW/+jke/2wIA+Ns3mzBkWvCLnp1bmVNuz1+n\n8lRxOV5csNPQSkZbs05inZ/MhKNeX4F7v9wAABjw4jJc+c4qw2wJFy31dDOPF+GT1ftw5dursDrj\nODYezHerf/reigxLRta2EPT07EK8vng3vlqn3sN6YERnky06x0PfuP6IQ6b9iik1/qDL0rLx8ap9\nmP57Jl5bvAdXvO26gTcdzEeXpxcGjCj0F8T05748LNrunhjs21TfXg1FpRU+Q+s9j2MlvjIiz1i5\nz+c+WfnngqKKy87i3/N3VveOftuTizeW7MHrPsqIZeWf8Vqj8FyfWb/f9RD5cfMRv3nGfQ3TP73d\nnCygV/dt49WWdtQ14nl10S58tHIf5m4+XP3ZtqxT+GLNfuzNPY0J765GYYnvDsOtM9bhpZ/TAAD/\nmL0VSVMXVHdQhBA4Wylw1TurccP0tT6PsTe3CAu3H8Nz81zz0Rk5rp7t5kMnke9RNzQSqboPACC3\nsLQ6fXdR2VncMmMdJr73B36q8Rt7+9cMXPqf38PuyAWLLQT9po/W4W0/ZbVqx5h3GQkewUVblMjA\nwyeL3UTzzs9S8cL8nV77f7QyE6UVlZj0yZ8orXAflt3zRSqu/2ANkqYuwOwN6msFxWVncf2Ha/DX\nLzd6fZa6Pw+frPIWv8lfpGLK15tUoxjVjmMlyU/9XP06p6AEh/LOYHPWSbdtCkrKUVkpMG/LEQx9\nZTlWpbt63jNWZmLGqn146JtNKCwpx6RP/nTb74zH9MnQV5bj4v9dgU0Hfc/dCuH+0Hh5YRoqKwVy\nCkvcfL2Fjz6yWa52iXH1vNrGvrUSv+7KxqoM1/dztsYT88p3VuGZuTsw6vXfsPnQSfy2J7f6s7Sj\nBRBCoPxsJW74cA1WZRyvzvdT1XG44u1VOHG6FJP+ux6dnvwZ/qi5vvDpH/urXx85WYwJ765G3xeW\nBH/BJnPdB67pvryiMlzw4lK8smiXpv2+XGtuoFnAikWRQCC3oS6tzasiHygzYWWlwFYfT2UhRHVC\nrx1HCvDm0nQs3HYUj1zaBRsP5OOXHedKhxWWuPcci8sr8OXaA3j6x+0+z32tctPdMbSDW/vqDNdQ\neOr3W718bCON8rMC+UVlWL8/D5O/2OD1eV5RGfq9sAQT+yXg+42uHufDszZj/VOjUa4I1uKd2Zj4\n3h9e+/73Oj4zAAARfklEQVS+JxdCCBBRdc88t7AUV6tsW0VOYSmGvrK8+v2Hv2Xisu6tq/eZc+9g\n3P7fP73+XpHCHZ+mVr/eebQAw15drpqd8YGZmzCgQ1MMeHEZAGBcr3g8PqYL1u071zP17G32//dS\nn+d96odtWJt5Asv+PtznNoNVprBmrMzEyK4t0bFFZOanyT/jGk1oDSD7Zv0hvHh1TxSVVaBRnVqB\ndwgTWwh6oOJEzRrUNscQqA+ttxw614O847P1WLE713sjAB2ecO/JvL9iLwC4TdX4YuG2Y5jh0fs+\neaYMTep59wALSsrRsHYMiAiH8s71LnccKUBW/hlkF5SgVnQUOre09kdzUXJzrEz3ntf212Or8j6p\nEnPAJcr9XliC8+PPPdjTc9Sj+5am5eCSbq3Q7dlfQjXb7X685n3fD4NI4/M1rt7iwbwzqp9XiTkA\nLNh61EuAqqYN/fHywjQs3ZmNvYoP+ZZDJ312cNT494I0/HtBGlY+PsLrs/q1rZWrk2fKcOxUSdD7\nTVuYho9W7sP258eggcHXYAtBjyTOqih6zakWX2IeLp5iDrimddRu8l7PLcaQzs1wTb+2eGTWFrfP\navY2reb163pjwEvLAm9Yg192HFNtzysqqx6J+OPuz1Mxtkd4uWq0ejBM6OM9r20UfTzcFvXga41e\nQTX58Df3VMzj310d0rkvetX9Pk1sWhd1alnj019Fn3+FNjX0kbIGdLqkggU90lDzN/5YRWzNYMMB\n33O/qzNOaBI4KwmlLOxri9UXOINh4Xb1h4JWbv14nabtzBQgLZ4ZdubjSfYvMelrnUVP5L4LDEHS\nHLoWYEWlJDMxI0e/U0i2eHpQD8y43VnQg6SZSV4LVvHJ7Smmnat1ozqBN7IxjeryAFgv+OGoDRb0\nIBnUqZnVJhjKiC7mJEACgKgouX+kj13axWoTmAjCjPEoC3qQ/PPKblabYCjcE9IP2ee1meAwI40E\n33FBwoLHMEwozPITza0XLOiMpdSK5gckE3kYkfa4MhS3riBhQWcsJaV9U6tNYBgv4lQC9sKlZiS4\nUdhe0M3M41KFZ+5pJnSmju1qtQnS0NzEiGnZad1Yfw+sYp5DD8z39w02/ZwvTuhh+jllJdaCBzLD\nyIrtf03xjesG3khnOraob/o5ZcXqcG6GMYvcQuNrk9pe0Bl7w0uiTCTSyaadtrAEnYguI6LdRJRB\nRFP1MiooG6w4KcNEIEYk6HIqj1xiz6CwkAWdiKIBvAtgLIBuAG4iIrmjbiTn/hGdTD+nrG799w03\n/7t85JLzTD+nrETbNIo5nB76AAAZQohMIUQZgG8AjNfHrMiGJB0XNKlrfp4aK9ZAzCCpmflD9m5t\nvAtXMM4iHEFPAFAz9ClLaZOeugYEHTgVWb1cxvWKt9oEJgzs+hs3/NdERJOJKJWIUnNzjSn+wOjD\n5SxCulGXvXdsjdGFKIwiHEE/DCCxxvu2SpsbQojpQogUIURKixYtwjidOvVq8w9HLzwLYDOhI3sm\nSTOx63y2FYQj6OsBJBNRByKKBXAjgHn6mKWd2jEs6AwjM89ewb4WWgl5XCGEqCCiBwD8AiAawCdC\niB26WcYwDAN5PaGMIKyJIiHEzwB+DrghwzBMiHRtzd47WpHTxYBhGGkY0MGajJyrp4605LzhYGtB\nH9RR7nJwTqGNAZntnEq3eO7N6oXeTgIp7eN0PZ4athb0p8adb7UJjA7cMbSD1SZIw5RRyVabwPjA\nDG8dWwu6le5M4/u0sezcsmFFVKWsXNajtdUmMD4wY3HX1oIeZeHyd8M69gw8iEQGdeKpM0Z+zEgZ\nYnNBt+7cHZo3sO7kksGBI4wTSIgzPnDP1oLeuF4tq02QhlaNrCtfxkUuGCcw+vxWhp/D1oLesiF7\nR+jFw6M59apeNKtvftZKxhj0FOF6JiT8srWgW4kQwmoTmAiFOLRRGrrFN9TtWLwoqqCW+WzOvYMs\nsOQcsi2Kto2rZ7UJ0sBLAvIwomtLq00IClsIeuO63nPlCU2sFaBr+ycG3shGtG/mPEFv3sCYqREr\nva8A4K0b+1h6fpno2y4Oaf+6TNO2kZDb33oLNKD2+2hm0I9RK8F6Zvx4/xCfn13eM3zf4aqCCrWi\nve3qrqGSTWJTeQX9yt7qMQM9Exobcj6rZ1zG97FHnZk/bBJaX1ujUI/r6buewHmtGqBfO44UBQB8\n+pcBuPsi92jCmAga1z55edeA2/RJbKL6pO/YvD7eu6V/WOffP20c3r25H/ZPG4dRXV2LOO/f0q/6\n8wVTLsLF5+mfi95orugVj0Z1YrD2iVFen03sq120ru7rEnTPxcqque5Jg9pj/7RxmPeA74euGs9f\n1b369Zd3Xlj92qgHhdVMGdlZ03bLHx2OFyb0wAMjfG8/+vxWaNOkLjo0PxdU1rttZH5v/nLbX9Ov\nbfXruHqx+PXvF6tut/jhi1HfhKIZthD0zi0b4Klx3fDDfYOr2yJh4SmhSV1cfF4LTB7Wye/Nu3/a\nOADBl7Va9+Qo/PbY8Or3m565pPr1xH7B9cI+npSCXS9oGzpGCu/c3A9bnxuD1h65Xq7t3xaDOzf3\nu2/NSN6RXVth/7RxuLrGQ+DBkZ2rI1SvUrb1ldUv48Wx+GbyQLe2bycPxKTBSYiNdv2EhiafsycS\nojXvubij2/v908bhpgHtVLf11VP+8s4L8c3kgWhQOwaz/zoIdw3rqLpdTSYP64gOzevjtoHt8eiY\nLvjpgaFu611X9IrHqn+MwPu3ujocyx8dXj2CsioJVzjcNqh99esbBySiYwvv+JSmJno92Wplr2+7\nOMy5dzCy8s9YbQoA92xsAzs2wzvLMwAA/7isK+4d3gnL0rK9iiB3alEfe3OLqt97zl1P7JuAa/q3\nRVFpBVo1Oidk0VGEuPqxmP/gUFzx9ircOrA9vt/oVSAKAureNzHRUfBVC2SYzXrvFyTFuXkZrXh0\nOIa/tgIAMP/Boeih9JDnbj6iun+PhEZ4cGQyBASGJjdD//YuIYmNicKsewbhwIkixMZE4aFvNgNw\nfXcDPRLBXai8X//0aJSUnwUAvHZdbzz63Rac10o/z4hQeWLs+bg4uQVunrGuuu3liT3x8sSeSJq6\nwG3bFg3dYxBuSEnE2J6tqx9S258fU/3Z2zf1xYNfbwIA/PbYcBQUV+D+mRtxMM/1m/ScVuip9Lp/\nemAoVmUcx73DO3nZes+wjliZnosbB7TDRyv3IbFpXRzKKw710g3ltet64+jJYvRKbIKEJnXQueW5\nv7Xa3/23x4ab6l5tK0EHgP7t49DfhKxlwTI0uTn+Z1B7fL7mQHXbKBUf1hFdWmJv7j4AwCe3pyAl\nyb1X8tp1vb2GeF/fPRCJTV0Phh4Jjat7/P4INIC5Y0gHHMwrwtK0nIDHMpP1T43GxoP5mLnuoOrn\na54YidaN6mD2hiwAriFvUo1hew8N0x3jeydUL2CN7Or+NxrQoWl1T/GRWVsCHqtx3VrVi/bX9m+L\nkV1bmtoj80egUczPUy7CiaJS1IqOQurTo5Hy76UAXGUdh3dR9+64snebakFvr4xwfn98RPVDwlcv\nu2fbxtXi7kmPhMbY/OylAM6NZvOLytD3hSV+7beCa/u3DbxRDRrWqWVqwWnbCXok07ttEwAH0LGF\n72RTbWqk5PQUE0B9vs6IXCfPXtkNK3bnYGlaTkT51LdoWBtjurfGmO7q0xaeIx5//OeG3mjXNPTE\nXztq9ExrsuThYT73iRQxr+K2ge1xsrhc9bNuNRbLmzeojXdu7osHZm4KaYSx+OFhqBcbrdv1xynH\naRQh7sF1akWhpLxS9bPR57d06xitfHwELnp1uVmmuREZ35YkTOyXgC6tG/rtJd4+OAlfrD2Al67u\nGfb5frhvMPafKHJrC6TNvRObIL+oDEBkrENopV5sNM6Una1+73mZl/dsje5t3L/3q/u696YmD+uI\nnUcLNPeyPFMSzL1/CH7fk4vkCJhS0coLE3p4ta36x4jquf+ajOsZj1Z/rRNS3m4jppneurEP+iZG\nxmh80zOX+pzOfP/W/iguP3dvJjath7h6tZB/Rv1BaiQs6DpCRAGH/FFRhOWPDvdq3/jMJSirUO8B\n+KJvuzj09ZizrK2IUHSU+nr33Bruk1WeQpGQS2XFo8NRftb39a99chTKa3w/VYUchp3nmlbQ4inU\nslEdzLx7YMDtfNE7sQl6JzYJef9IwVcQGRHhgiRtC5NqsSF6E0nul/6mTWpFR6GWxwPy5Yk98fLC\nXaaPMMjM4XZKSopITU017XxOJL+oDNNXZuLRS7sgPacQa/eewO1D1AtIVFYKvLksHZMGtUezBtYl\n5wqVgpJyNKrDCdrMJiOnEHH1Ym15z9gVItoghEgJuB0LOsMwTGSjVdBt4YfOMAzDBIYFnWEYRhJY\n0BmGYSSBBZ1hGEYSWNAZhmEkgQWdYRhGEljQGYZhJIEFnWEYRhJMDSwiolwABwJuqE5zAMd1NMcO\n8DU7A75mZxDONbcXQgTMc22qoIcDEaVqiZSSCb5mZ8DX7AzMuGaecmEYhpEEFnSGYRhJsJOgT7fa\nAAvga3YGfM3OwPBrts0cOsMwDOMfO/XQGYZhGD/YQtCJ6DIi2k1EGUQ01Wp7goGIEoloORHtJKId\nRPSQ0t6UiJYQUbryf1yNfZ5QrnU3EY2p0d6fiLYpn/0fKTXkiKg2EX2rtK8joiSzr9MTIoomok1E\nNF95L/X1AgARNSGi2US0i4jSiGiQzNdNRA8r9/R2IvqaiOrIeL1E9AkR5RDR9hptplwnEU1SzpFO\nRJMCGiuEiOh/AKIB7AXQEUAsgC0AulltVxD2xwPop7xuCGAPgG4AXgUwVWmfCuAV5XU35RprA+ig\nXHu08tmfAAYCIAALAYxV2u8D8IHy+kYA30bAdT8CYCaA+cp7qa9XseUzAHcpr2MBNJH1ugEkANgH\noK7yfhaA22W8XgDDAPQDsL1Gm+HXCaApgEzl/zjldZxfW63+EWj4MgcB+KXG+ycAPGG1XWFcz1wA\nlwDYDSBeaYsHsFvt+gD8onwH8QB21Wi/CcCHNbdRXsfAFbxAFl5jWwDLAIzEOUGX9noVOxrDJXDk\n0S7ldcMl6IcUsYkBMB/ApRJfbxLcBd3w66y5jfLZhwBu8menHaZcqm6cKrKUNtuhDKX6AlgHoJUQ\n4qjy0TEArZTXvq43QXnt2e62jxCiAsApAM10vwDtvAngcQA1qz7LfL2AqzeWC+C/ylTTDCKqD0mv\nWwhxGMBrAA4COArglBBiMSS9XhXMuM6gtc8Ogi4FRNQAwBwAfxNCFNT8TLgev1K4GxHRFQByhBAb\nfG0j0/XWIAauYfn7Qoi+AIrgGopXI9N1K3PG4+F6kLUBUJ+Ibq25jUzX649Iuk47CPphAIk13rdV\n2mwDEdWCS8y/EkJ8rzRnE1G88nk8gByl3df1HlZee7a77UNEMXAN/0/ofyWaGALgKiLaD+AbACOJ\n6EvIe71VZAHIEkKsU97PhkvgZb3u0QD2CSFyhRDlAL4HMBjyXq8nZlxn0NpnB0FfDyCZiDoQUSxc\niwbzLLZJM8pK9scA0oQQb9T4aB6AqlXrSXDNrVe136isfHcAkAzgT2V4V0BEA5Vj/o/HPlXHuhbA\nr0qvwXSEEE8IIdoKIZLg+lv9KoS4FZJebxVCiGMADhFRF6VpFICdkPe6DwIYSET1FDtHAUiDvNfr\niRnX+QuAS4koThkRXaq0+caKBYYQFiQuh8s7ZC+Ap6y2J0jbh8I1HNsKYLPy73K45siWAUgHsBRA\n0xr7PKVc624oK+FKewqA7cpn7+BcYFgdAN8ByIBrJb2j1det2DUc5xZFnXC9fQCkKn/rH+HyTJD2\nugE8D2CXYusXcHl2SHe9AL6Ga52gHK6R2J1mXSeAO5T2DAB/CWQrR4oyDMNIgh2mXBiGYRgNsKAz\nDMNIAgs6wzCMJLCgMwzDSAILOsMwjCSwoDMMw0gCCzrDMIwksKAzDMNIwv8DX9Y35IpeRYwAAAAA\nSUVORK5CYII=\n",
      "text/plain": [
       "<matplotlib.figure.Figure at 0x10e075748>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "%matplotlib inline\n",
    "from matplotlib import pyplot as plt\n",
    "plt.plot(waveform)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 4 - Writing CUDA Kernels.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 4: Writing CUDA Kernels\n",
    "\n",
    "## The CUDA Programming Model\n",
    "\n",
    "Ufuncs (and generalized ufuncs mentioned in the bonus notebook at the end of the tutorial) are the easiest way in Numba to use the GPU, and present an abstraction that requires minimal understanding of the CUDA programming model.  However, not all functions can be written as ufuncs.  Many problems require greater flexibility, in which case you want to write a *CUDA kernel*, the topic of this notebook. \n",
    "\n",
    "Fully explaining the CUDA programming model is beyond the scope of this tutorial.  We highly recommend that everyone writing CUDA kernels with Numba take the time to read Chapters 1 and 2 of the CUDA C Programming Guide:\n",
    "\n",
    " * Introduction: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#introduction\n",
    " * Programming Model: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model\n",
    "\n",
    "The programming model chapter gets a little in to C specifics, but familiarity with CUDA C can help write better CUDA kernels in Python.\n",
    "\n",
    "For the purposes of this tutorial, the most important thing is to understand this diagram:\n",
    "![Thread Hierarchy](http://docs.nvidia.com/cuda/cuda-c-programming-guide/graphics/grid-of-thread-blocks.png \"Thread Hierarchy (from CUDA C Programming Guide)\")\n",
    "\n",
    "We will be writing a *kernel* that decribes the execution of a single thread in this hierarchy.  The CUDA compiler and driver will execute our kernel across a *thread grid* that is divided into *blocks* of threads.  Threads within the same block can exchange data very easily during the execution of a kernel, whereas threads in different blocks should generally not communicate with each other (with a few exceptions).\n",
    "\n",
    "Deciding the best size for the CUDA thread grid is a complex problem (and depends on both the algorithm and the specific GPU compute capability), but here are some very rough heuristics that we follow:\n",
    "\n",
    "  * the size of a block should be a multiple of 32 threads, with typical block sizes between 128 and 512 threads per block.\n",
    "  * the size of the grid should ensure the full GPU is utilized where possible.  Launching a grid where the number of blocks is 2x-4x the number of \"multiprocessors\" on the GPU is a good starting place.  Something in the range of 20 - 100 blocks is usually a good starting point.\n",
    "  * The CUDA kernel launch overhead does depend on the number of blocks, so we find it best not to launch a grid where the number of threads equals the number of input elements when the input size is very big.  We'll show a pattern for dealing with large inputs below.\n",
    "\n",
    "Each thread distinguishes itself from the other threads using its unique thread (`threadIdx`) and block (`blockIdx`) index values, which can be multidimensional if launched that way."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## A First Example\n",
    "\n",
    "This all will be a little overwhelming at first, so let's start with a concrete example.  Let's write our addition function for 1D NumPy arrays.  CUDA kernels are compiled using the `numba.cuda.jit` decorator (not to be confused with the `numba.jit` decorator for the CPU):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def add_kernel(x, y, out):\n",
    "    tx = cuda.threadIdx.x # this is the unique thread ID within a 1D block\n",
    "    ty = cuda.blockIdx.x  # Similarly, this is the unique block ID within the 1D grid\n",
    "\n",
    "    block_size = cuda.blockDim.x  # number of threads per block\n",
    "    grid_size = cuda.gridDim.x    # number of blocks in the grid\n",
    "    \n",
    "    start = tx + ty * block_size\n",
    "    stride = block_size * grid_size\n",
    "\n",
    "    # assuming x and y inputs are same length\n",
    "    for i in range(start, x.shape[0], stride):\n",
    "        out[i] = x[i] + y[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot more typing than our ufunc example, and it is much more limited: only works on 1D arrays, doesn't verify input sizes match, etc.  Most of the function is spent figuring out how to turn the block and grid indices and dimensions into unique offsets into the input arrays.  The pattern of computing a starting index and a stride is a common way to ensure that your grid size is independent of the input size.  The striding will maximize bandwidth by ensuring that threads with consecuitive indices are accessing consecutive memory locations as much as possible.  Thread indices beyond the length of the input (`x.shape[0]`, since `x` is a NumPy array) automatically skip over the for loop.\n",
    "\n",
    "Also note that we did not need to specify a type signature for the CUDA kernel.  Unlike `@vectorize`, Numba can infer the type signature from the inputs automatically, and much more reliably.\n",
    "\n",
    "Let's call the function now on some data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[  0.   3.   6.   9.  12.  15.  18.  21.  24.  27.]\n"
     ]
    }
   ],
   "source": [
    "import numpy as np\n",
    "\n",
    "n = 100000\n",
    "x = np.arange(n).astype(np.float32)\n",
    "y = 2 * x\n",
    "out = np.empty_like(x)\n",
    "\n",
    "threads_per_block = 128\n",
    "blocks_per_grid = 30\n",
    "\n",
    "add_kernel[blocks_per_grid, threads_per_block](x, y, out)\n",
    "print(out[:10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The unusual syntax for calling the kernel function is designed to mimic the CUDA Runtime API in C, where the above call would look like:\n",
    "```\n",
    "add_kernel<<<blocks_per_grid, threads_per_block>>>(x, y, out)\n",
    "```\n",
    "The arguments within the square brackets define the size and shape of the thread grid, and the arguments with parentheses correspond to the kernel function arguments.\n",
    "\n",
    "Note that, unlike the ufunc, the arguments are passed to the kernel as full NumPy arrays.  The kernel can access any element in the array it wants, regardless of its position in the thread grid.  This is why CUDA kernels are significantly more powerful that ufuncs.  (But with great power, comes a greater amount of typing...)\n",
    "\n",
    "Numba includes [several helper functions](http://numba.pydata.org/numba-doc/dev/cuda/kernels.html#absolute-positions) to simplify the thread offset calculations above.  You can write the function much more simply as:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def add_kernel(x, y, out):\n",
    "    start = cuda.grid(1)      # 1 = one dimensional thread grid, returns a single value\n",
    "    stride = cuda.gridsize(1) # ditto\n",
    "\n",
    "    # assuming x and y inputs are same length\n",
    "    for i in range(start, x.shape[0], stride):\n",
    "        out[i] = x[i] + y[i]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As before, using NumPy arrays forces Numba to allocate GPU memory, copy the arguments to the GPU, run the kernel, then copy the argument arrays back to the host.  This not very efficient, so you will often want to allocate device arrays:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "x_device = cuda.to_device(x)\n",
    "y_device = cuda.to_device(y)\n",
    "out_device = cuda.device_array_like(x)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 44.29 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "100 loops, best of 3: 2.41 ms per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_kernel[blocks_per_grid, threads_per_block](x, y, out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 203.24 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000 loops, best of 3: 492 µs per loop\n"
     ]
    }
   ],
   "source": [
    "%timeit add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device); out_device.copy_to_host()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kernel Synchronization\n",
    "\n",
    "*One extremely important caveat should be mentioned here*: CUDA kernel execution is designed to be asynchronous with respect to the host program.  This means that the kernel launch (`add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)`) returns immediately, allowing the CPU to continue executing while the GPU works in the background.  Only host<->device memory copies or an explicit synchronization call will force the CPU to wait until previously queued CUDA kernels are complete.\n",
    "\n",
    "When you pass host NumPy arrays to a CUDA kernel, Numba has to synchronize on your behalf, but if you pass device arrays, processing will continue.  If you launch multiple kernels in sequence without any synchronization in between, they will be queued up to run sequentially by the driver, which is usually what you want.  If you want to run multiple kernels on the GPU in parallel (sometimes a good idea, but beware of race conditions!), take a look at [CUDA streams](http://numba.pydata.org/numba-doc/dev/cuda-reference/host.html?highlight=synchronize#stream-management).\n",
    "\n",
    "Here's some sample timings (using `%time`, which only runs the statement once to ensure our measurement isn't affected by the finite depth of the CUDA kernel queue):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 2.53 ms, sys: 1.11 ms, total: 3.64 ms\n",
      "Wall time: 2.47 ms\n"
     ]
    }
   ],
   "source": [
    "# CPU input/output arrays, implied synchronization for memory copies\n",
    "%time add_kernel[blocks_per_grid, threads_per_block](x, y, out)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 467 µs, sys: 64 µs, total: 531 µs\n",
      "Wall time: 489 µs\n"
     ]
    }
   ],
   "source": [
    "# GPU input/output arrays, no synchronization (but force sync before and after)\n",
    "cuda.synchronize()\n",
    "%time add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device)\n",
    "cuda.synchronize()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU times: user 1.27 ms, sys: 546 µs, total: 1.82 ms\n",
      "Wall time: 1.01 ms\n"
     ]
    }
   ],
   "source": [
    "# GPU input/output arrays, include explicit synchronization in timing\n",
    "cuda.synchronize()\n",
    "%time add_kernel[blocks_per_grid, threads_per_block](x_device, y_device, out_device); cuda.synchronize()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Always be sure to synchronize with the GPU when benchmarking CUDA kernels!**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Atomic Operations and Avoiding Race Conditions\n",
    "\n",
    "CUDA, like many general purpose parallel execution frameworks, makes it possible to have race condtions in your code.  A race condition in CUDA arises when threads read or write a memory location that might be modified by another independent thread.  Generally speaking, you need to worry about:\n",
    "\n",
    " * read-after-write hazards: One thread is reading a memory location at the same time another thread might be writing to it.\n",
    " * write-after-write hazards: Two threads are writing to the same memory location, and only one write will be visible when the kernel is complete.\n",
    " \n",
    "A common strategy to avoid both of these hazards is to organize your CUDA kernel algorithm such that each thread has exclusive responsibility for unique subsets of output array elements, and/or to never use the same array for both input and output in a single kernel call.  (Iterative algorithms can use a double-buffering strategy if needed, and switch input and output arrays on each iteration.)\n",
    "\n",
    "However, there are many cases where different threads need to combine results.  Consider something very simple, like: \"every thread increments a global counter.\"  Implementing this in your kernel requires each thread to:\n",
    "\n",
    "1. Read the current value of a global counter.\n",
    "2. Compute `counter + 1`.\n",
    "3. Write that value back to global memory.\n",
    "\n",
    "However, there is no guarantee that another thread has not changed the global counter between steps 1 and 3.  To resolve this problem, CUDA provides \"atomic operations\" which will read, modify and update a memory location in one, indivisible step.  Numba supports several of these functions, [described here](http://numba.pydata.org/numba-doc/dev/cuda/intrinsics.html#supported-atomic-operations).\n",
    "\n",
    "Let's make our thread counter kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def thread_counter_race_condition(global_counter):\n",
    "    global_counter[0] += 1  # This is bad\n",
    "    \n",
    "@cuda.jit\n",
    "def thread_counter_safe(global_counter):\n",
    "    cuda.atomic.add(global_counter, 0, 1)  # Safely add 1 to offset 0 in global_counter array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 42,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Should be 4096: [3]\n"
     ]
    }
   ],
   "source": [
    "# This gets the wrong answer\n",
    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
    "thread_counter_race_condition[64, 64](global_counter)\n",
    "\n",
    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 43,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Should be 4096: [4096]\n"
     ]
    }
   ],
   "source": [
    "# This works correctly\n",
    "global_counter = cuda.to_device(np.array([0], dtype=np.int32))\n",
    "thread_counter_safe[64, 64](global_counter)\n",
    "\n",
    "print('Should be %d:' % (64*64), global_counter.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## Exercise\n",
    "\n",
    "For this exercise, create a histogramming kernel.  This will take an array of input data, a range and a number of bins, and count how many of the input data elements land in each bin.  Below is an example CPU implementation of histogramming:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 50,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cpu_histogram(x, xmin, xmax, histogram_out):\n",
    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
    "    # Note that we don't have to pass in nbins explicitly, because the size of histogram_out determines it\n",
    "    nbins = histogram_out.shape[0]\n",
    "    bin_width = (xmax - xmin) / nbins\n",
    "    \n",
    "    # This is a very slow way to do this with NumPy, but looks similar to what you will do on the GPU\n",
    "    for element in x:\n",
    "        bin_number = np.int32((element - xmin)/bin_width)\n",
    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
    "            # only increment if in range\n",
    "            histogram_out[bin_number] += 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 56,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([   5,   86,  494, 1562, 2891, 2879, 1527,  462,   88,    5], dtype=int32)"
      ]
     },
     "execution_count": 56,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "x = np.random.normal(size=10000, loc=0, scale=1).astype(np.float32)\n",
    "xmin = np.float32(-4.0)\n",
    "xmax = np.float32(4.0)\n",
    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
    "\n",
    "cpu_histogram(x, xmin, xmax, histogram_out)\n",
    "\n",
    "histogram_out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 57,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def cuda_histogram(x, xmin, xmax, histogram_out):\n",
    "    '''Increment bin counts in histogram_out, given histogram range [xmin, xmax).'''\n",
    "    \n",
    "    pass  # Replace this with your implementation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 5 - Troubleshooting and Debugging.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 5: Troubleshooting and Debugging\n",
    "\n",
    "## Note about the Terminal\n",
    "\n",
    "Debugging is an important part of programming.  Unfortuntely, it is pretty difficult to debug CUDA kernels directly in the Jupyter notebook for a variety of reasons, so this notebook will show terminal commands by executing Jupyter notebook cells using the shell.  These shell commands will appear in notebook cells with the command line prefixed by `!`. When applying the debug methods described in this notebook, you will likely run the commands in the terminal directly.\n",
    "\n",
    "## Printing\n",
    "\n",
    "A common debugging strategy is printing to the console.  Numba supports printing from CUDA kernels, with some restrictions.  Note that output printed from a CUDA kernel will not be captured by Jupyter, so you will need to debug with a script you can run from the terminal.\n",
    "\n",
    "Let's look at a CUDA kernel with a bug:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import numpy as np\r\n",
      "\r\n",
      "from numba import cuda\r\n",
      "\r\n",
      "@cuda.jit\r\n",
      "def histogram(x, xmin, xmax, histogram_out):\r\n",
      "    nbins = histogram_out.shape[0]\r\n",
      "    bin_width = (xmax - xmin) / nbins\r\n",
      "\r\n",
      "    start = cuda.grid(1)\r\n",
      "    stride = cuda.gridsize(1)\r\n",
      "\r\n",
      "    for i in range(start, x.shape[0], stride):\r\n",
      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
      "            histogram_out[bin_number] += 1\r\n",
      "\r\n",
      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
      "xmin = np.float32(-4.0)\r\n",
      "xmax = np.float32(4.0)\r\n",
      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
      "\r\n",
      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
      "\r\n",
      "print('input count:', x.shape[0])\r\n",
      "print('histogram:', histogram_out)\r\n",
      "print('count:', histogram_out.sum())\r\n"
     ]
    }
   ],
   "source": [
    "! cat debug/ex1.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "When we run this code to histogram 50 values, we see the histogram is not getting 50 entries: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "input count: 50\r\n",
      "histogram: [0 0 1 1 1 1 1 1 0 0]\r\n",
      "count: 6\r\n"
     ]
    }
   ],
   "source": [
    "! python debug/ex1.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*(You might have already spotted the mistake, but let's pretend we don't know the answer.)*\n",
    "\n",
    "We hypothesize that maybe a bin calculation error is causing many of the histogram entries to appear out of range.  Let's add some printing around the `if` statement to show us what is going on:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import numpy as np\r\n",
      "\r\n",
      "from numba import cuda\r\n",
      "\r\n",
      "@cuda.jit\r\n",
      "def histogram(x, xmin, xmax, histogram_out):\r\n",
      "    nbins = histogram_out.shape[0]\r\n",
      "    bin_width = (xmax - xmin) / nbins\r\n",
      "\r\n",
      "    start = cuda.grid(1)\r\n",
      "    stride = cuda.gridsize(1)\r\n",
      "\r\n",
      "    for i in range(start, x.shape[0], stride):\r\n",
      "        bin_number = np.int32((x[i] - xmin)/bin_width)\r\n",
      "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\r\n",
      "            histogram_out[bin_number] += 1\r\n",
      "            print('in range', x[i], bin_number)\r\n",
      "        else:\r\n",
      "            print('out of range', x[i], bin_number)\r\n",
      "\r\n",
      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
      "xmin = np.float32(-4.0)\r\n",
      "xmax = np.float32(4.0)\r\n",
      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
      "\r\n",
      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
      "\r\n",
      "print('input count:', x.shape[0])\r\n",
      "print('histogram:', histogram_out)\r\n",
      "print('count:', histogram_out.sum())\r\n"
     ]
    }
   ],
   "source": [
    "! cat debug/ex1a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This kernel will print every value and bin number it calculates.  Looking at one of the print statements, we see that `print` supports constant strings, and scalar values:\n",
    "\n",
    "``` python\n",
    "print('in range', x[i], bin_number)\n",
    "```\n",
    "\n",
    "String substitution (using C printf syntax or the newer `format()` syntax) is not supported.  If we run this script we see:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "in range 1.674757 7\r\n",
      "in range -0.492113 4\r\n",
      "in range 0.526627 5\r\n",
      "in range 2.359267 7\r\n",
      "in range -1.768394 2\r\n",
      "in range 0.342256 5\r\n",
      "in range 0.793954 5\r\n",
      "in range -0.338127 4\r\n",
      "in range 1.275327 6\r\n",
      "in range -0.877891 3\r\n",
      "in range 0.922818 6\r\n",
      "in range 0.635215 5\r\n",
      "in range 0.371592 5\r\n",
      "in range 0.925639 6\r\n",
      "in range -1.116025 3\r\n",
      "in range 0.615792 5\r\n",
      "in range 0.879030 6\r\n",
      "in range 2.061845 7\r\n",
      "in range 0.037717 5\r\n",
      "in range -0.440858 4\r\n",
      "in range 1.056680 6\r\n",
      "in range -0.111198 4\r\n",
      "in range 0.452880 5\r\n",
      "in range -0.154099 4\r\n",
      "in range 0.518296 5\r\n",
      "in range 0.072946 5\r\n",
      "in range 1.209770 6\r\n",
      "in range -0.057651 4\r\n",
      "in range 0.154896 5\r\n",
      "in range 1.099341 6\r\n",
      "in range 0.271862 5\r\n",
      "in range 0.643499 5\r\n",
      "in range 0.824574 6\r\n",
      "in range 0.809260 6\r\n",
      "in range 0.354412 5\r\n",
      "in range -0.365111 4\r\n",
      "in range 0.594393 5\r\n",
      "in range 0.830470 6\r\n",
      "in range -0.402743 4\r\n",
      "in range -0.554546 4\r\n",
      "in range -0.507898 4\r\n",
      "in range -0.006359 4\r\n",
      "in range -0.316683 4\r\n",
      "in range 2.015556 7\r\n",
      "in range -1.288521 3\r\n",
      "in range 0.401858 5\r\n",
      "in range -1.410364 3\r\n",
      "in range -0.459540 4\r\n",
      "in range 0.633090 5\r\n",
      "in range -1.000390 3\r\n",
      "input count: 50\r\n",
      "histogram: [0 0 1 2 2 2 2 2 0 0]\r\n",
      "count: 11\r\n"
     ]
    }
   ],
   "source": [
    "! python debug/ex1a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Scanning down that output, we see that all 50 values should be in range.  Clearly we have some kind of race condition updating the histogram.  In fact, the culprit line is:\n",
    "\n",
    "``` python\n",
    "histogram_out[bin_number] += 1\n",
    "```\n",
    "\n",
    "which should be (as you may have seen in a previous exercise)\n",
    "\n",
    "``` python\n",
    "cuda.atomic.add(histogram_out, bin_number, 1)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## CUDA Simulator\n",
    "\n",
    "Back in the early days of CUDA, `nvcc` had an \"emulator\" mode that would execute CUDA code on the CPU.  That functionality was dropped in later CUDA releases after `cuda-gdb` was created.  We missed emulator mode so much, Numba includes a \"CUDA simulator\" in Numba that runs your CUDA code with the Python interpreter on the host CPU.  This allows you to debug the logic of your code using Python modules and functions that would otherwise be not allowed by the compile.\n",
    "\n",
    "A very common use case is to start the Python debugger inside one thread of a CUDA kernel:\n",
    "``` python\n",
    "import numpy as np\n",
    "\n",
    "from numba import cuda\n",
    "\n",
    "@cuda.jit\n",
    "def histogram(x, xmin, xmax, histogram_out):\n",
    "    nbins = histogram_out.shape[0]\n",
    "    bin_width = (xmax - xmin) / nbins\n",
    "\n",
    "    start = cuda.grid(1)\n",
    "    stride = cuda.gridsize(1)\n",
    "\n",
    "    ### DEBUG FIRST THREAD\n",
    "    if start == 0:\n",
    "        from pdb import set_trace; set_trace()\n",
    "    ###\n",
    "\n",
    "    for i in range(start, x.shape[0], stride):\n",
    "        bin_number = np.int32((x[i] + xmin)/bin_width)\n",
    "\n",
    "        if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
    "            cuda.atomic.add(histogram_out, bin_number, 1)\n",
    "\n",
    "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\n",
    "xmin = np.float32(-4.0)\n",
    "xmax = np.float32(4.0)\n",
    "histogram_out = np.zeros(shape=10, dtype=np.int32)\n",
    "\n",
    "histogram[64, 64](x, xmin, xmax, histogram_out)\n",
    "\n",
    "print('input count:', x.shape[0])\n",
    "print('histogram:', histogram_out)\n",
    "print('count:', histogram_out.sum())\n",
    "```\n",
    "\n",
    "This code allows a debug session like the following to take place:\n",
    "```\n",
    "(gtc2017) 0179-sseibert:gtc2017-numba sseibert$ NUMBA_ENABLE_CUDASIM=1 python debug/ex2.py\n",
    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(18)histogram()\n",
    "-> for i in range(start, x.shape[0], stride):\n",
    "(Pdb) n\n",
    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(19)histogram()\n",
    "-> bin_number = np.int32((x[i] + xmin)/bin_width)\n",
    "(Pdb) n\n",
    "> /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex2.py(21)histogram()\n",
    "-> if bin_number >= 0 and bin_number < histogram_out.shape[0]:\n",
    "(Pdb) p bin_number, x[i]\n",
    "(-6, -1.4435024)\n",
    "(Pdb) p x[i], xmin, bin_width\n",
    "(-1.4435024, -4.0, 0.80000000000000004)\n",
    "(Pdb) p (x[i] - xmin) / bin_width\n",
    "3.1956219673156738\n",
    "(Pdb) q\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "collapsed": true
   },
   "source": [
    "## CUDA Memcheck\n",
    "\n",
    "Another common error occurs when a CUDA kernel has an invalid memory access, typically caused by running off the end of an array.  The full CUDA toolkit from NVIDIA (not the `cudatoolkit` conda package) contain a utility called `cuda-memcheck` that can check for a wide range of memory access mistakes in CUDA code.\n",
    "\n",
    "Let's debug the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "import numpy as np\r\n",
      "\r\n",
      "from numba import cuda\r\n",
      "\r\n",
      "@cuda.jit\r\n",
      "def histogram(x, xmin, xmax, histogram_out):\r\n",
      "    nbins = histogram_out.shape[0]\r\n",
      "    bin_width = (xmax - xmin) / nbins\r\n",
      "\r\n",
      "    start = cuda.grid(1)\r\n",
      "    stride = cuda.gridsize(1)\r\n",
      "\r\n",
      "    for i in range(start, x.shape[0], stride):\r\n",
      "        bin_number = np.int32((x[i] + xmin)/bin_width)\r\n",
      "\r\n",
      "        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
      "            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
      "\r\n",
      "x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n",
      "xmin = np.float32(-4.0)\r\n",
      "xmax = np.float32(4.0)\r\n",
      "histogram_out = np.zeros(shape=10, dtype=np.int32)\r\n",
      "\r\n",
      "histogram[64, 64](x, xmin, xmax, histogram_out)\r\n",
      "\r\n",
      "print('input count:', x.shape[0])\r\n",
      "print('histogram:', histogram_out)\r\n",
      "print('count:', histogram_out.sum())\r\n"
     ]
    }
   ],
   "source": [
    "! cat debug/ex3.py"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= CUDA-MEMCHECK\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (49,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (48,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (47,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (46,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (45,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (44,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (43,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (42,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (41,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (40,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (39,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (38,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (37,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (36,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (35,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (34,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (33,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (32,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (31,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (30,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (29,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (28,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (27,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (26,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (25,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (24,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (23,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (22,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (21,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (20,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (19,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (18,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (17,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (16,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (15,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (14,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (13,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (12,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (11,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (10,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (9,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (8,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (7,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (6,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (5,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (4,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (3,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (2,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (1,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (0,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
      "=========     Saved host backtrace up to driver entry point at error\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuMemcpyDtoH_v2 + 0x184) [0x158214]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========     Host Frame:[0x7fff5b7db780]\n",
      "=========\n",
      "Traceback (most recent call last):\n",
      "  File \"debug/ex3.py\", line 24, in <module>\n",
      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 703, in __call__\n",
      "    cfg(*args)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 483, in __call__\n",
      "    sharedmem=self.sharedmem)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 585, in _kernel_call\n",
      "    wb()\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 600, in <lambda>\n",
      "    retr.append(lambda: devary.copy_to_host(val, stream=stream))\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/devicearray.py\", line 198, in copy_to_host\n",
      "    _driver.device_to_host(hostary, self, self.alloc_size, stream=stream)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1484, in device_to_host\n",
      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 262, in safe_cuda_api_call\n",
      "    self._check_error(fname, retcode)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 299, in _check_error\n",
      "    raise CudaAPIError(retcode, msg)\n",
      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= ERROR SUMMARY: 51 errors\r\n"
     ]
    }
   ],
   "source": [
    "! cuda-memcheck python debug/ex3.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The output of `cuda-memcheck` is clearly showing a problem with our histogram function:\n",
    "```\n",
    "========= Invalid __global__ write of size 4\n",
    "=========     at 0x00000460 in cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
    "```\n",
    "But we don't know which line it is.  To get better error information, we can turn \"debug\" mode on when compiling the kernel, by changing the kernel to look like this:\n",
    "``` python\n",
    "@cuda.jit(debug=True)\n",
    "def histogram(x, xmin, xmax, histogram_out):\n",
    "    nbins = histogram_out.shape[0]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= CUDA-MEMCHECK\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (49,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (48,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (47,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (46,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (45,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (44,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (43,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (42,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (41,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (40,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (39,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (38,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (37,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (36,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (35,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (34,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (33,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (32,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (31,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (30,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (29,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (28,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (27,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (26,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (25,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (24,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (23,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (22,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (21,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (20,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (19,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (18,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (17,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (16,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (15,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (14,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (13,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (12,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e4 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (11,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (10,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (9,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (8,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (7,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (6,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (5,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (4,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (3,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (2,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601ec is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (1,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601f0 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Invalid __global__ write of size 4\n",
      "=========     at 0x00000ba8 in /Users/sseibert/continuum/conferences/gtc2017-numba/debug/ex3a.py:17:cudapy::__main__::histogram$241(Array<float, int=1, C, mutable, aligned>, float, float, Array<int, int=1, C, mutable, aligned>)\n",
      "=========     by thread (0,0,0) in block (0,0,0)\n",
      "=========     Address 0x700a601e8 is out of bounds\n",
      "=========     Saved host backtrace up to driver entry point at kernel launch time\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuLaunchKernel + 0x234) [0x15ca94]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========\n",
      "========= Program hit CUDA_ERROR_LAUNCH_FAILED (error 719) due to \"unspecified launch failure\" on CUDA API call to cuMemcpyDtoH_v2. \n",
      "=========     Saved host backtrace up to driver entry point at error\n",
      "=========     Host Frame:/Library/Frameworks/CUDA.framework/Versions/A/Libraries/libcuda_355.10.05.15_mercury.dylib (cuMemcpyDtoH_v2 + 0x184) [0x158214]\n",
      "=========     Host Frame:/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/lib-dynload/_ctypes.cpython-36m-darwin.so (ffi_call_unix64 + 0x4f) [0xd2b7]\n",
      "=========     Host Frame:[0x7fff50092d60]\n",
      "=========\n",
      "Traceback (most recent call last):\n",
      "  File \"debug/ex3a.py\", line 24, in <module>\n",
      "    histogram[64, 64](x, xmin, xmax, histogram_out)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 703, in __call__\n",
      "    cfg(*args)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 483, in __call__\n",
      "    sharedmem=self.sharedmem)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/compiler.py\", line 560, in _kernel_call\n",
      "    driver.device_to_host(ctypes.addressof(excval), excmem, excsz)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 1484, in device_to_host\n",
      "    fn(host_pointer(dst), device_pointer(src), size, *varargs)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 262, in safe_cuda_api_call\n",
      "    self._check_error(fname, retcode)\n",
      "  File \"/Users/sseibert/anaconda/envs/gtc2017/lib/python3.6/site-packages/numba/cuda/cudadrv/driver.py\", line 299, in _check_error\n",
      "    raise CudaAPIError(retcode, msg)\n",
      "numba.cuda.cudadrv.driver.CudaAPIError: [719] Call to cuMemcpyDtoH results in UNKNOWN_CUDA_ERROR\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "========= ERROR SUMMARY: 51 errors\r\n"
     ]
    }
   ],
   "source": [
    "! cuda-memcheck python debug/ex3a.py"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we get an error message that includes a source file and line number: `ex3a.py:17`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "    15\t\r\n",
      "    16\t        if bin_number >= 0 or bin_number < histogram_out.shape[0]:\r\n",
      "    17\t            cuda.atomic.add(histogram_out, bin_number, 1)\r\n",
      "    18\t\r\n",
      "    19\tx = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)\r\n"
     ]
    }
   ],
   "source": [
    "! cat -n debug/ex3a.py | grep -C 2 \"17\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "At this point, we might realize that our if statement incorrect has an `or` instead of an `and`.\n",
    "\n",
    "`cuda-memcheck` has different modes for detecting different kinds of problems (similar to `valgrind` for debugging CPU memory access errors).  Take a look at the documentation for more information: http://docs.nvidia.com/cuda/cuda-memcheck/"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: 6 - Extra Topics.ipynb
================================================
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# GTC 2017 Numba Tutorial Notebook 6: Extra Topics\n",
    "\n",
    "## Random Numbers\n",
    "\n",
    "GPUs can be extremely useful for Monte Carlo applications where you need to use large amounts of random numbers.  CUDA ships with an excellent set of random number generation algorithms in the cuRAND library.  Unfortunately, cuRAND is defined in a set of C headers which Numba can't easily compile or link to.  (Numba's CUDA JIT does not ever create C code for CUDA kernels.)  It is on the Numba roadmap to find a solution to this problem, but it make take some time.\n",
    "\n",
    "In the meantime, Numba version 0.33 and later includes the `xoroshiro128+` generator, which is pretty high quality, though with a smaller period ($2^{128} - 1$) than the XORWOW generator in cuRAND.  To use it, you will want to initialize the RNG state on the host for each thread in your kernel:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "from numba import cuda\n",
    "from numba.cuda.random import create_xoroshiro128p_states, xoroshiro128p_uniform_float32\n",
    "\n",
    "threads_per_block = 64\n",
    "blocks = 24\n",
    "rng_states = create_xoroshiro128p_states(threads_per_block * blocks, seed=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This state creation function initializes each state to be in the same sequence designated by the seed, but separated by $2^{64}$ steps from each other.  This ensures that different threads will not accidentally end up with overlapping sequences (since you will not have enough patience to draw anywhere near $2^{64}$ random numbers in a single thread).\n",
    "\n",
    "We can use these random number states in our kernel by passing it in as an argument:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "@cuda.jit\n",
    "def compute_pi(rng_states, iterations, out):\n",
    "    \"\"\"Find the maximum value in values and store in result[0]\"\"\"\n",
    "    thread_id = cuda.grid(1)\n",
    "\n",
    "    # Compute pi by drawing random (x, y) points and finding what\n",
    "    # fraction lie inside a unit circle\n",
    "    inside = 0\n",
    "    for i in range(iterations):\n",
    "        x = xoroshiro128p_uniform_float32(rng_states, thread_id)\n",
    "        y = xoroshiro128p_uniform_float32(rng_states, thread_id)\n",
    "        if x**2 + y**2 <= 1.0:\n",
    "            inside += 1\n",
    "\n",
    "    out[thread_id] = 4.0 * inside / iterations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "pi: 3.14167\n"
     ]
    }
   ],
   "source": [
    "out = np.zeros(threads_per_block * blocks, dtype=np.float32)\n",
    "compute_pi[blocks, threads_per_block](rng_states, 10000, out)\n",
    "print('pi:', out.mean())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Shared Memory\n",
    "\n",
    "We briefly mention in notebook #4 that the CUDA programming model organizes threads into a two-layer structure.  A grid is composed of many blocks, which are composed of many threads.  Threads within the same block can communicate much more easily than threads in different blocks.  The main mechanism for this communication is *shared memory*.  Shared memory is discussed extensively in the CUDA C Programming Guide, as well as many other books on CUDA programming.  We will only describe it very briefly here, and focus mainly on the Python syntax for using it.\n",
    "\n",
    "Shared memory is a section of memory that is visible at the block level.  Different blocks cannot see each other's shared memory, and all the threads within a block see the same shared memory.  It does not persist after a CUDA kernel finishes executing.  Shared memory is scarce hardware resource, so should be used sparingly or side effects such as lower performance or even kernel launch failure (if you exceed the hardware limit of 48 kB per block) will occur.\n",
    "\n",
    "Shared memory is good for several things:\n",
    " * caching of lookup tables that will be randomly accessed\n",
    " * buffering output from threads so it can be coalesced before writing it back to device memory.\n",
    " * staging data for scatter/gather operations within a block\n",
    " \n",
    "As an example of the power of shared memory, let's write a transpose kernel that takes a 2D array in row-major order and puts it in column-major order.  (This is based on Mark Harris' blog post at: https://devblogs.nvidia.com/parallelforall/efficient-matrix-transpose-cuda-cc/)\n",
    "\n",
    "First, let's do the naive approach where we let each thread read and write individual elements independently:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "TILE_DIM = 32\n",
    "BLOCK_ROWS = 8\n",
    "\n",
    "@cuda.jit\n",
    "def transpose(a_in, a_out):\n",
    "    x = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.x\n",
    "    y = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.y\n",
    "\n",
    "    for j in range(0, TILE_DIM, BLOCK_ROWS):\n",
    "        a_out[x, y + j] = a_in[y + j, x]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[      0       1       2 ...,    1021    1022    1023]\n",
      " [   1024    1025    1026 ...,    2045    2046    2047]\n",
      " [   2048    2049    2050 ...,    3069    3070    3071]\n",
      " ..., \n",
      " [1045504 1045505 1045506 ..., 1046525 1046526 1046527]\n",
      " [1046528 1046529 1046530 ..., 1047549 1047550 1047551]\n",
      " [1047552 1047553 1047554 ..., 1048573 1048574 1048575]]\n"
     ]
    }
   ],
   "source": [
    "size = 1024\n",
    "a_in = cuda.to_device(np.arange(size*size, dtype=np.int32).reshape((size, size)))\n",
    "a_out = cuda.device_array_like(a_in)\n",
    "\n",
    "print(a_in.copy_to_host())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 126.03 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000 loops, best of 3: 872 µs per loop\n",
      "[[      0    1024    2048 ..., 1045504 1046528 1047552]\n",
      " [      1    1025    2049 ..., 1045505 1046529 1047553]\n",
      " [      2    1026    2050 ..., 1045506 1046530 1047554]\n",
      " ..., \n",
      " [   1021    2045    3069 ..., 1046525 1047549 1048573]\n",
      " [   1022    2046    3070 ..., 1046526 1047550 1048574]\n",
      " [   1023    2047    3071 ..., 1046527 1047551 1048575]]\n"
     ]
    }
   ],
   "source": [
    "grid_shape = (int(size/TILE_DIM), int(size/TILE_DIM))\n",
    "%timeit transpose[grid_shape,(TILE_DIM, BLOCK_ROWS)](a_in, a_out); cuda.synchronize()\n",
    "print(a_out.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now let's use shared memory to copy a 32x32 tile at a time.  We'll use a global value for the tile size so it will be known act compile time:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import numba.types\n",
    "\n",
    "TILE_DIM_PADDED = TILE_DIM + 1  # Read Mark Harris' blog post to find out why this improves performance!\n",
    "\n",
    "@cuda.jit\n",
    "def tile_transpose(a_in, a_out):\n",
    "    # THIS CODE ASSUMES IT IS RUNNING WITH A BLOCK DIMENSION OF (TILE_SIZE x TILE_SIZE)\n",
    "    # AND INPUT IS A MULTIPLE OF TILE_SIZE DIMENSIONS\n",
    "    tile = cuda.shared.array((TILE_DIM, TILE_DIM_PADDED), numba.types.int32)\n",
    "\n",
    "    x = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.x\n",
    "    y = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.y\n",
    "    \n",
    "    for j in range(0, TILE_DIM, BLOCK_ROWS):\n",
    "        tile[cuda.threadIdx.y + j, cuda.threadIdx.x] = a_in[y + j, x] # transpose tile into shared memory\n",
    "\n",
    "    cuda.syncthreads()  # wait for all threads in the block to finish updating shared memory\n",
    "\n",
    "    #Compute transposed offsets\n",
    "    x = cuda.blockIdx.y * TILE_DIM + cuda.threadIdx.x\n",
    "    y = cuda.blockIdx.x * TILE_DIM + cuda.threadIdx.y\n",
    "\n",
    "    for j in range(0, TILE_DIM, BLOCK_ROWS):\n",
    "        a_out[y + j, x] = tile[cuda.threadIdx.x, cuda.threadIdx.y + j];\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The slowest run took 224.15 times longer than the fastest. This could mean that an intermediate result is being cached.\n",
      "1000 loops, best of 3: 615 µs per loop\n",
      "[[      0    1024    2048 ..., 1045504 1046528 1047552]\n",
      " [      1    1025    2049 ..., 1045505 1046529 1047553]\n",
      " [      2    1026    2050 ..., 1045506 1046530 1047554]\n",
      " ..., \n",
      " [   1021    2045    3069 ..., 1046525 1047549 1048573]\n",
      " [   1022    2046    3070 ..., 1046526 1047550 1048574]\n",
      " [   1023    2047    3071 ..., 1046527 1047551 1048575]]\n"
     ]
    }
   ],
   "source": [
    "a_out = cuda.device_array_like(a_in) # replace with new array\n",
    "\n",
    "%timeit tile_transpose[grid_shape,(TILE_DIM, BLOCK_ROWS)](a_in, a_out); cuda.synchronize()\n",
    "print(a_out.copy_to_host())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a 30% speed up!"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Generalized Ufuncs\n",
    "\n",
    "Ufuncs broadcast a scalar function over array inputs but what if you want to broadcast a lower dimensional array function over a higher dimensional array?  This is called a *generalized ufunc* (\"gufunc\"), and it opens up a whole new frontier for applying ufuncs.\n",
    "\n",
    "Generalized ufuncs are a little more tricky becuase they need a *signature* (not to be confused with the Numba type signature) that shows the index ordering when dealing with multiple inputs.  Fully explaining \"gufunc\" signatures is beyond the scope of this tutorial, but you can learn more from:\n",
    "\n",
    "* The NumPy docs on gufuncs: https://docs.scipy.org/doc/numpy/reference/c-api.generalized-ufuncs.html\n",
    "* The Numba docs on gufuncs: http://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator\n",
    "* The Numba docs on CUDA gufuncs: http://numba.pydata.org/numba-doc/latest/cuda/ufunc.html#generalized-cuda-ufuncs\n",
    "\n",
    "Let's write our own normalization function.  This will take an array input and compute the L2 norm along the last dimension.  Generalized ufuncs take their output array as the last argument, rather than returning a value.  If the output is a scalar (as it will be for the norm function), then we will still receive a 1D array as the output argument, but we will write the scalar output to element 0 of the array."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "from numba import guvectorize\n",
    "import math\n",
    "\n",
    "@guvectorize(['(float32[:], float32[:])'], # have to include the output array in the type signature\n",
    "             '(i)->()',                 # map a 1D array to a scalar output\n",
    "             target='cuda')\n",
    "def l2_norm(vec, out):\n",
    "    acc = 0.0\n",
    "    for value in vec:\n",
    "        acc += value**2\n",
    "    out[0] = math.sqrt(acc)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To test this, let's construct some points on the unit circle:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[ 0.08839864  0.99608518]\n",
      " [-0.24687871  0.96904639]\n",
      " [ 0.05012312  0.99874305]\n",
      " [ 0.55552226 -0.83150167]\n",
      " [-0.19207358  0.98138053]\n",
      " [-0.93777456  0.34724468]\n",
      " [-0.99945911 -0.03288583]\n",
      " [ 0.67578176  0.73710176]\n",
      " [-0.16344104  0.9865531 ]\n",
      " [ 0.99877497 -0.04948285]]\n"
     ]
    }
   ],
   "source": [
    "angles = np.random.uniform(-np.pi, np.pi, 10)\n",
    "coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)\n",
    "print(coords)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "As expected, the L2 norm is 1.0, up to rounding errors:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 1.        ,  1.        ,  1.        ,  1.        ,  1.        ,\n",
       "        1.        ,  1.        ,  1.        ,  1.        ,  0.99999994], dtype=float32)"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "l2_norm(coords)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}


================================================
FILE: README.md
================================================
# gtc2017-numba
Numba tutorial for GTC 2017 conference


================================================
FILE: debug/ex1.py
================================================
import numpy as np

from numba import cuda

@cuda.jit
def histogram(x, xmin, xmax, histogram_out):
    nbins = histogram_out.shape[0]
    bin_width = (xmax - xmin) / nbins

    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(start, x.shape[0], stride):
        bin_number = np.int32((x[i] - xmin)/bin_width)
        if bin_number >= 0 and bin_number < histogram_out.shape[0]:
            histogram_out[bin_number] += 1

x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
histogram_out = np.zeros(shape=10, dtype=np.int32)

histogram[64, 64](x, xmin, xmax, histogram_out)

print('input count:', x.shape[0])
print('histogram:', histogram_out)
print('count:', histogram_out.sum())


================================================
FILE: debug/ex1a.py
================================================
import numpy as np

from numba import cuda

@cuda.jit
def histogram(x, xmin, xmax, histogram_out):
    nbins = histogram_out.shape[0]
    bin_width = (xmax - xmin) / nbins

    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(start, x.shape[0], stride):
        bin_number = np.int32((x[i] - xmin)/bin_width)
        if bin_number >= 0 and bin_number < histogram_out.shape[0]:
            histogram_out[bin_number] += 1
            print('in range', x[i], bin_number)
        else:
            print('out of range', x[i], bin_number)

x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
histogram_out = np.zeros(shape=10, dtype=np.int32)

histogram[64, 64](x, xmin, xmax, histogram_out)

print('input count:', x.shape[0])
print('histogram:', histogram_out)
print('count:', histogram_out.sum())


================================================
FILE: debug/ex2.py
================================================
import numpy as np

from numba import cuda

@cuda.jit
def histogram(x, xmin, xmax, histogram_out):
    nbins = histogram_out.shape[0]
    bin_width = (xmax - xmin) / nbins

    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    ### DEBUG FIRST THREAD
    if start == 0:
        from pdb import set_trace; set_trace()
    ###

    for i in range(start, x.shape[0], stride):
        bin_number = np.int32((x[i] + xmin)/bin_width)

        if bin_number >= 0 and bin_number < histogram_out.shape[0]:
            cuda.atomic.add(histogram_out, bin_number, 1)

x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
histogram_out = np.zeros(shape=10, dtype=np.int32)

histogram[64, 64](x, xmin, xmax, histogram_out)

print('input count:', x.shape[0])
print('histogram:', histogram_out)
print('count:', histogram_out.sum())


================================================
FILE: debug/ex3.py
================================================
import numpy as np

from numba import cuda

@cuda.jit
def histogram(x, xmin, xmax, histogram_out):
    nbins = histogram_out.shape[0]
    bin_width = (xmax - xmin) / nbins

    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(start, x.shape[0], stride):
        bin_number = np.int32((x[i] + xmin)/bin_width)

        if bin_number >= 0 or bin_number < histogram_out.shape[0]:
            cuda.atomic.add(histogram_out, bin_number, 1)

x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
histogram_out = np.zeros(shape=10, dtype=np.int32)

histogram[64, 64](x, xmin, xmax, histogram_out)

print('input count:', x.shape[0])
print('histogram:', histogram_out)
print('count:', histogram_out.sum())


================================================
FILE: debug/ex3a.py
================================================
import numpy as np

from numba import cuda

@cuda.jit(debug=True)
def histogram(x, xmin, xmax, histogram_out):
    nbins = histogram_out.shape[0]
    bin_width = (xmax - xmin) / nbins

    start = cuda.grid(1)
    stride = cuda.gridsize(1)

    for i in range(start, x.shape[0], stride):
        bin_number = np.int32((x[i] + xmin)/bin_width)

        if bin_number >= 0 or bin_number < histogram_out.shape[0]:
            cuda.atomic.add(histogram_out, bin_number, 1)

x = np.random.normal(size=50, loc=0, scale=1).astype(np.float32)
xmin = np.float32(-4.0)
xmax = np.float32(4.0)
histogram_out = np.zeros(shape=10, dtype=np.int32)

histogram[64, 64](x, xmin, xmax, histogram_out)

print('input count:', x.shape[0])
print('histogram:', histogram_out)
print('count:', histogram_out.sum())


================================================
FILE: docker/README.md
================================================
# Docker Instructions

To build the images:

```bash
docker build -t conda_cuda_base:latest ./base
docker build -t numba_gtc2017:latest ./notebooks
```

The notebook image takes an optional build argument `BRANCH` to
select the branch or commit to checkout

```
docker build -t numba_gtc2017:latest --build-arg BRANCH=master ./notebooks
```

Run the notebook with:

```bash
nvidia-docker run -p 8888:8888 -it numba_gtc2017:latest
```

It will start the jupyter notebook automatically.


================================================
FILE: docker/base/Dockerfile
================================================
# Build with:
#     docker build -t conda_cuda_base:latest .
FROM nvidia/cuda:8.0-devel-ubuntu14.04

RUN apt-get update
RUN apt-get install -y wget git vim

# Add user
RUN useradd -ms /bin/bash appuser
USER appuser
WORKDIR /home/appuser

# Download miniconda
RUN wget https://repo.continuum.io/miniconda/Miniconda3-4.3.11-Linux-x86_64.sh
# Install miniconda
RUN bash Miniconda3-4.3.11-Linux-x86_64.sh -b -p /home/appuser/Miniconda3
# Append PATH to miniconda
ENV PATH=$PATH:/home/appuser/Miniconda3/bin

# Install Jupyter Notebook
RUN conda install -y jupyter notebook

# Install cudatoolkit
RUN conda install -y -c numba cudatoolkit=8

# Install Numba
ARG NUMBA_VERSION=0.33
RUN conda install -y -c numba numba=$NUMBA_VERSION


================================================
FILE: docker/notebooks/Dockerfile
================================================
# Build with:
#     docker build -t numba_gtc2017:latest .
# Run with:
#     nvidia-docker run -p 8888:8888 -it numba_gtc2017:latest
FROM conda_cuda_base:latest

RUN conda install -y scipy matplotlib

RUN git clone https://github.com/ContinuumIO/gtc2017-numba

WORKDIR ./gtc2017-numba

ARG BRANCH="master"
RUN git fetch && git checkout $BRANCH

CMD jupyter notebook --ip=*