Full Code of libmir/dcompute for AI

master c739946b3ad7 cached
66 files
280.3 KB
86.3k tokens
1 requests
Download .txt
Showing preview only (299K chars total). Download the full file or copy to clipboard to get everything.
Repository: libmir/dcompute
Branch: master
Commit: c739946b3ad7
Files: 66
Total size: 280.3 KB

Directory structure:
gitextract_f6jo1osq/

├── .gitignore
├── LICENSE.txt
├── README.md
├── docs/
│   ├── 00-prerequsites.md
│   ├── 01-installation.md
│   ├── 02-hardware.md
│   ├── 03-kernels.md
│   ├── 04-std/
│   │   ├── 00-intro.md
│   │   └── 01-index.md
│   ├── 05-driver/
│   │   └── 00-intro.md
│   └── README.md
├── dub.json
└── source/
    └── dcompute/
        ├── driver/
        │   ├── README.md
        │   ├── backend.d
        │   ├── cuda/
        │   │   ├── TODO
        │   │   ├── buffer.d
        │   │   ├── context.d
        │   │   ├── device.d
        │   │   ├── event.d
        │   │   ├── kernel.d
        │   │   ├── memory.d
        │   │   ├── package.d
        │   │   ├── platform.d
        │   │   ├── program.d
        │   │   ├── queue.d
        │   │   └── unified_buffer.d
        │   ├── error.d
        │   ├── ocl/
        │   │   ├── buffer.d
        │   │   ├── context.d
        │   │   ├── device.d
        │   │   ├── event.d
        │   │   ├── image.d
        │   │   ├── kernel.d
        │   │   ├── memory.d
        │   │   ├── package.d
        │   │   ├── platform.d
        │   │   ├── program.d
        │   │   ├── queue.d
        │   │   ├── raw/
        │   │   │   ├── enums.d
        │   │   │   ├── functions.d
        │   │   │   └── package.d
        │   │   ├── sampler.d
        │   │   └── util.d
        │   └── util.d
        ├── kernels/
        │   ├── README.md
        │   └── package.d
        ├── std/
        │   ├── atomic.d
        │   ├── atomic_common.d
        │   ├── cuda/
        │   │   ├── atomic.d
        │   │   ├── index.d
        │   │   └── sync.d
        │   ├── floating.d
        │   ├── index.d
        │   ├── integer.d
        │   ├── memory.d
        │   ├── opencl/
        │   │   ├── image.d
        │   │   ├── index.d
        │   │   ├── math.d
        │   │   └── sync.d
        │   ├── pack.d
        │   ├── package.d
        │   ├── sync.d
        │   └── warp.d
        └── tests/
            ├── dummykernels.d
            ├── main.d
            └── test.d

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# Compiled Object files
*.o
*.obj

# Other outputs from LLVM
*.ll
*.spv
*.spt
*.ptx
*.bc

# Compiled Dynamic libraries
*.so
*.dylib
*.dll

# Compiled Static libraries
*.a
*.lib
__dummy_docs
# Executables
*.exe

# Code coverage
*.lst

# DUB
.dub
docs.json
__dummy.html

.DS_Store


================================================
FILE: LICENSE.txt
================================================
Boost Software License - Version 1.0 - August 17th, 2003

Permission is hereby granted, free of charge, to any person or organization
obtaining a copy of the software and accompanying documentation covered by
this license (the "Software") to use, reproduce, display, distribute,
execute, and transmit the Software, and to prepare derivative works of the
Software, and to permit third-parties to whom the Software is furnished to
do so, all subject to the following:

The copyright notices in the Software and this entire statement, including
the above license grant, this restriction and the following disclaimer,
must be included in all copies of the Software, in whole or in part, and
all derivative works of the Software, unless such copies or derivative
works are solely in the form of machine-executable object code generated by
a source language processor.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. IN NO EVENT
SHALL THE COPYRIGHT HOLDERS OR ANYONE DISTRIBUTING THE SOFTWARE BE LIABLE
FOR ANY DAMAGES OR OTHER LIABILITY, WHETHER IN CONTRACT, TORT OR OTHERWISE,
ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
DEALINGS IN THE SOFTWARE.


================================================
FILE: README.md
================================================
# dcompute

[![Latest version](https://img.shields.io/dub/v/dcompute.svg)](http://code.dlang.org/packages/dcompute)
[![Latest version](https://img.shields.io/github/tag/libmir/dcompute.svg?maxAge=3600)](http://code.dlang.org/packages/dcompute)
[![License](https://img.shields.io/dub/l/dcompute.svg)](http://code.dlang.org/packages/dcompute)
[![Gitter](https://img.shields.io/gitter/room/libmir/public.svg)](https://gitter.im/libmir/public)

## About

This project is a set of libraries designed to work with [LDC][1] to 
enable native execution of D on GPUs (and other more exotic targets of OpenCL such as FPGAs DSPs, hereafter just 'GPUs') on the OpenCL and CUDA runtimes. As DCompute depends on developments in LDC for the code generation, a relatively recent LDC is required, use [1.8.0](https://github.com/ldc-developers/ldc/releases/tag/v1.8.0) or newer.

There are four main parts: 
* [std](https://github.com/libmir/dcompute/tree/master/source/dcompute/std): A library containing standard functionality for targetting GPUs and abstractions over the intrinsics of OpenCL and CUDA.
* [driver](https://github.com/libmir/dcompute/tree/master/source/dcompute/driver): For handling all the compute API interactions and provide a friendly, easy-to-use, consistent interface. Of course you can always get down to a lower level of interaction if you need to. You can also use this to execute non-D kernels (e.g. OpenCL or CUDA).
* [kernels](https://github.com/libmir/dcompute/tree/master/source/dcompute/kernels): A set of standard kernels and primitives to cover a large number of use cases and serve as documentation on how (and how not) to use this library.
* [tests](https://github.com/libmir/dcompute/tree/master/source/dcompute/tests): A framework for testing kernels. The suite is runnable with `dub test` (see `dub.json` for the configuration used).

## Examples

> **Note:** The `@kernel()` syntax requires LDC 1.42 or later. If you are using an older version of LDC, please use `@kernel` (without parentheses).

Kernel:
```
@kernel() void saxpy(GlobalPointer!(float) res,
                   float alpha,
                   GlobalPointer!(float) x,
                   GlobalPointer!(float) y, 
                   size_t N)
{
    auto i = GlobalIndex.x;
    if (i >= N) return;
    res[i] = alpha*x[i] + y[i];
}
```

Invoke with (CUDA):
```
q.enqueue!(saxpy)
    ([N,1,1],[1,1,1]) // Grid & block & optional shared memory
    (b_res,alpha,b_x,b_y, N); // kernel arguments
```
equivalent to the CUDA code
```
saxpy<<<1,N,0,q>>>(b_res,alpha,b_x,b_y, N);
```

For more examples and the full code see `source/dcompute/tests`.
## Build Instructions

To build DCompute you will need:
* [ldc][1] as the D dcompiler.
* a SPIRV capable LLVM (available [here](https://github.com/thewilsonator/llvm/tree/compute) to build ldc to to support SPIRV (required for OpenCL)).
* or LDC built with any LLVM 3.9.1 or greater that has the NVPTX backend enabled, to support CUDA.
* [dub](https://github.com/dlang/dub) then just run `$dub build.`

Alternatively, you can include dcompute as a dependency, as shown below:
  * add
    ```json
	"dependencies": {
		"dcompute": {
			"version": "~>0.1.1"
		}
	},
    ```
    to your `dub.json` under `dependencies`. You should include the following dub flags under `dflags-ldc`, which are passed to the compiler:
	```json
	"dflags-ldc": ["-mdcompute-targets=cuda-800","-mdcompute-targets=ocl-300","-version=LDC_DCompute","-oq"],
	```
	The dflags will be passed to LDC to generate code for the specified targets. You can run `ldc2 --help` to look for that flag. Use `ocl-xy0` for OpenCL x.y and `cuda-xy0` for CUDA Compute Capability x.y. So the above flags are for OpenCL 3.0 and CUDA CC 8.0. The two flags must be included separately as shown above.
    * If you get an error saying `Need to use a DCompute enabled compiler`, you likely forgot the `-mdcompute-targets` flags.
    * Check NVIDIA's [website](https://developer.nvidia.com/cuda-gpus) for your CUDA Compute Capability.
  * Alternatively add the equivalent to dub.sdl, `dependency "dcompute" version="~>0.1.1"` to your `dub.sdl` and include the dflags.


If you get an error like `Error: unrecognized switch '-mdcompute-targets=cuda-210`, make sure you are using LDC and not DMD: passing `--compiler=/path/to/ldc2` to dub will force it to use `/path/to/ldc2` as the D compiler.

A dmd compatible d compiler,[dmd](https://github.com/dlang/dmd), ldmd or gdmd (available as part of [ldc][1] and [gdc](https://github.com/D-Programming-GDC/GDC) respectively), and cmake for building ldc is also required if you need to build ldc yourself.
 
## Getting Started

Please see the [documentation](https://github.com/libmir/dcompute/blob/master/docs/README.md).

## TODO

Generate OpenCL builtins from [here](https://github.com/KhronosGroup/SPIR-Tools/wiki/SPIR-2.0-built-in-functions)

[1]: https://github.com/ldc-developers/ldc


### Our sponsors

[<img src="https://raw.githubusercontent.com/libmir/mir-algorithm/master/images/symmetry.png" height="80" />](http://symmetryinvestments.com/) 	&nbsp; 	&nbsp;	&nbsp;	&nbsp;
[<img src="https://raw.githubusercontent.com/libmir/mir-algorithm/master/images/kaleidic.jpeg" height="80" />](https://github.com/kaleidicassociates)


================================================
FILE: docs/00-prerequsites.md
================================================
# Prerequisites

In order to use DCompute there are a few things you need before you start:

* Capable hardware

* Drivers for said hardware

* LDC: the LLVM D compiler

## Hardware & Drivers

For NVidia users any GPU with compute capability 2.1 or higher should work, 
although the hardware will dictate the available functionality.
You'll need to intall the CUDA development tools.

For everyone else you will need either a CPU or GPU (or other accellerator) 
with an OpenCL 2.1 or higher device implementation.

## LDC

Due to the fact that DCompute leverages the LLVM NVPTX (for CUDA) & SPIR-V (for OpenCL)
backends to generate compute kernel code.

To see what targets your version of LDC has, execute `ldc2 -version`.
We aim to support the most recent releases of LDC, but due to the nature of development
some features in DCompute are dependent on features of LDC that may require upgrading your
compiler.

If you wish to be on the bleeding edge we recommend building LLVM & LDC from source. 
Be warned that LLVM has the tendency to break compatibility with LDC so expect that you may
have to revert syncing with LLVM. This goes the other way too fixing LDC to be compatible 
with LLVM trunk will likely break it with a slightly older trunk.



================================================
FILE: docs/01-installation.md
================================================
Installation
============

LDC
---

As mentioned previously DCompute requires the use of LDC as the D compiler.
All [recent releases of LDC](https://github.com/ldc-developers/ldc/releases)
have the NVPTX backend enabled for targetting NVidia hardware via CUDA.

To verify that your LDC build can target both nvptx and spirv backends, you
can run `ldc2 --version` and look for `nvptx` and `nvptx64` as well as
`spirv32` and `spirv64` under Registered targets.

DCompute
--------

If you are using dub (highly recommended) then all you need to do is add 
`"dcompute": "~>0.1.1"` to your dub.json or 
`dependency "dcompute" version="~>0.1.1"` to your dub.sdl 
dependencies and you should be good to go and can ignore the rest of this section.

If you are not using dub DCompute has a few of dependencies that you need to 
include:

* [derelict-cl](https://github.com/DerelictOrg/DerelictCL) for OpenCL bindings
* [bindbc-cuda](https://github.com/badnikhil/bindbc-cuda) for CUDA bindings
* [derelict-util](https://github.com/DerelictOrg/DerelictUtil) shared library loading utilities used by derelict-cl

Configuring bindbc-cuda
-----------------------

Unlike the previous Derelict bindings, `bindbc-cuda` requires you to specify which
CUDA Driver API version to target via a D version flag in your `dub.json`.
This controls which host-side CUDA functions (e.g. `cuMemPrefetchAsync`) are available.

Add the appropriate version to your `dub.json` configuration:

```json
"versions": ["CUDA_120"]
```

Supported version flags: `CUDA_100`, `CUDA_101`, `CUDA_102`, `CUDA_110`, `CUDA_111`,
`CUDA_112`, `CUDA_118`, `CUDA_120`, `CUDA_122`, `CUDA_124`, `CUDA_130`, `CUDA_132`.

If no version flag is specified, `bindbc-cuda` defaults to `CUDA_100` (CUDA 10.0).
Choose the version that matches the CUDA toolkit installed on your system — you can
check yours by running `nvcc --version`.

**Note:** This version flag is independent of the LDC `-mdcompute-targets` flag.
The `dflags` target (e.g. `cuda-210`) controls which GPU hardware architecture
LDC generates PTX code for, while the `versions` flag controls which driver API
functions are available on the host side.

Drivers
-------

To utilise the hardware you need drivers that implement OpenCL 2.1 or higher or CUDA.
Please consult your hardware vendors website for drivers.

TODO: add a list.


================================================
FILE: docs/02-hardware.md
================================================
Hardware
========

Writing code for DCompute kernels is a bit different from regular CPU programming.

Most noticable is that you write the kernel as the body of a for loop that is then
vectorised and run in parallel by the `device`. As a consequence of this, there are
no sequencing guaruntees and branching is done as vector mask operations. This
includes `while` style loops, they will continue until every lane of the vector has
completed the loop.

Virtual function and function pointers are infeasable and therefore not supported,
this includes classes and delegates. Alias template parameters still work.

Due to the large number of concurrent threads, it is very easy to end up with a
data race, not help by the fact that any synchronisation (fences, atomics) must
be done manually. Fences and atomics can be quite expensive.

CPUs
----
Caches are present and reasonable in size. Vectors are relatively short. Branch
prediction is good.

GPUs
----

Caches may be present but are much smaller relative to the number of threads.
Vectors are generally wider than CPUs. Branch prediction is absent. Top level
dcache is small, you really dont want to spill your stack. Texture fetch means
you can load from nearby in 2D or 3D efficiently.

FPGAs
-----

Instructions are in hardware, each and every one of them counts: shrinking your
instruction count can increase your vector width as vector width is determined
by the available datapaths. Execution speed is determined by dataflow.
Timing is very important. You tell a CPU what to do, while you tell an FPGA what to be



================================================
FILE: docs/03-kernels.md
================================================
Kernels
=======

At the heart of DCompute is are the special attributes `@compute` and `@kernel()` from the module `ldc.dcompute`

> **Note:** The `@kernel()` syntax requires LDC 1.42 or later. If you are using an older version of LDC, please use `@kernel` (without parentheses).

`@compute` tell the d compiler that this module should be built to target the device. 
`@compute` takes a single parameter that Indicate wether to target only the device 
(`@compute(CompileFor.deviceOnly)`) or to target host as well (`@compute(CompileFor.hostAndDevice)`).

`@kernel()` specifies that the attached function should be an entry point for the device,
i.e. you can tell the driver to execute this function on the device, 
whereas you can't for functions that aren't marked `@kernel()`.

Address Spaces 
--------------

Also critical in using DCompute is the notion of address spaced pointers.
These are available from the module `ldc.dcompute` in the form of the magic template
`Pointer!(uint addrspace,T)` which is a pointer to a `T` that resides in the address space `addrspace`. 
there are 5 address spaces Global, Shared, Constant, Private and Generic.

Global is available to all tasks on the device. It is the only address space that the host can both read and write. 

Shared is memory that is local to a group of threads/work items. 
Threads (or work items in OpenCL speak) are the unit of execution.

Constant memory is memory that is writeable by the host but read only by the device
and is kind of like read only pages but is has some spacial chaching properties.

Private memory is local to a thread and contains its registers and stack. 

Generic is not really an address space but a Generic pointer can point anywhere in 
the other address spaces and is useful if you are writing library routines that 
don't know ahead of time where the pointer will point to. You could of course just template the address space.

For more information on this concept just search for documentation on OpenCL and/or CUDA.

The table below shows the equivalent terms in DCompute, OpenCL and CUDA.

|  DCompute  |  OpenCL    |   CUDA         |
|------------|------------|----------------|
|   Global   | `__global`   |  `__device__`    |
|   Shared   | `__local`    |  `__shared__`    |
|   Constant | `__constant` |  `__constant__`  |
|   Private  | `__private`  |  `__local__ `    |
|   Generic  | `__generic`  | (no qualifier) |


Hello World
-----------

About the simplest kernel you can have is shown below (note that @kernel() functions MUST return `void` or you'll get errors)

```d
@compute(CompileFor.deviceOnly) module mykernels;
import ldc.dcompute;
@kernel() void mykernel(GlobalPointer!float a,GlobalPointer!float b, float c)
{
*a = *b + c;
}
```

Its not a very useful kernel because it only assigns to the first element of `a`.

Compile with `ldc2 -mdcompute-targets=ocl-210,cuda-350 -oq` to target OpenCL 2.1 and CUDA SM 3.5.

Non D kernels
-------------

While a major part of DCompute is being able to write kernels in D, there is nothing stopping 
you using it as a nicer wrapper for kernels written in e.g. OpenCL C or CUDA. 
All that you need to ensure is that the (mangled) name and signature of the kernels D declaration match
with its definition in the other language and you can use it as is it were a D kernel.

For OpenCL this means declaring the kernels `extern(C)`, for CUDA `extern(C++)` unless the kernel is declared 
`extern "C"`, in which case use `extern(C)`. You will also need to alter the build process to compile and link
the foreign kernel.

E.g.
OpenCL:
```opencl
__kernel void foo() {}
```

CUDA:
```cuda
extern "C" __global__ void foo() {}
```

D:
```d
@compute(CompileFor.deviceOnly)
module bar;

extern(C) @kernel() void foo();
```



================================================
FILE: docs/04-std/00-intro.md
================================================
The device standard library
============================

Much like the regular standard library the device standard library 
(`dcompute.std.*`) provides implementations of common functions,
usually implemented as compiler intrinsics.


================================================
FILE: docs/04-std/01-index.md
================================================
Index
=====

To do anything useful with DCompute a thread needs to know it's index, it's position.
If you take a look at `dcompute.std.index` you'll see there are quite a few to choose from.
Most of the indices are three dimensional and represent offsets in a "3D" view of memory.
Of course not all problems are 3D so the y and z values are not always useful.

If you come from OpenCL or CUDA the table below should help you familiarise yourself with the different indices available.

Index Terminology:

| DCompute           | CUDA                        | OpenCL
|--------------------|-----------------------------|--------
| GlobalDimension    | `gridDim*blockDim`            | get_global_size()
| GlobalIndex        | `blockDim*blockIdx+threadIdx` | get_global_id()
|                    |                             |
| GroupDimension     | gridDim                     | get_num_groups()
| GroupIndex         | blockIdx                    | get_group_id()
|                    |                             |
| SharedDimension    | blockDim                    | get_local_size()
| SharedIndex        | threadIdx                   | get_local_id()
|                    |                             |
| GlobalIndex.linear | A nasty calculation         | get_global_linear_id()
| SharedIndex.linear | Ditto                       | get_local_linear_id()

Note:
\*Index.{x,y,z} are bounded by \*Dimension.{x,y,z}

Use SharedIndex's to index Shared Memory and GlobalIndex's to index Global Memory

A Group is the ratio of Global to Shared. GroupDimension is NOT the size of a single
group, (thats SharedDimension) rather it is the number of groups along a given dimension.
Similarly GroupIndex is how many units of the SharedDimension along a given dimension.

Extending the previous example to add a constant to an array and assign it to another 
(we could have also used `GlobalIndex.linear`). We have:

```d
@compute(CompileFor.deviceOnly) module mykernels;
import ldc.attributes;
import ldc.dcompute;
import dcompute.std.index;
alias gf = GlobalPointer!float;
@kernel() void mykernel(gf a, gf b, float c)
{
    auto i = GlobalIndex.x;
    a[i] = b[i] + c;
}
```  

With the same command line as before.

Autoindex
---------

`AutoIndexed` is a type that automatically indexes a `GlobalPointer` or `SharedPointer` 
for making kernel lambda nicer to use.


================================================
FILE: docs/05-driver/00-intro.md
================================================
Driver
======

Now that you've successfully written your kernel, how do you execute it?
That's the job of the driver.

The driver (`dcompute.driver`) manages the interactions with the compute APIs
(OpenCL and CUDA). This doesn't stop you interacting with them directly, it
just provides you with a consistent and (as much as is possible) a boiler-plate 
free interface.

API objects
-----------

There are a number of driver API objects that wrap the underlying compute API 
objects. They are summarised briefly below. More in depth information is available
in the corresponding subsection of this chapter.

**Platform:** Represents one implementation of a compute API. You can query object for the
devices that are available though it.

**Device:** Represents a unit of execution (e.g. a GPU). Group devices together to form a
context. You can query a large number of properties about performance characteristics
and available memory.

**Context:** A key API object. You create queues, buffers/images, samplers and programs from it.

**Memory:** Represents a region of memory. An abstract base class of buffers & images.

**Buffers:** Represents a 1,2 or 3D (possibly strided) linear view of memory.

**Images:**  Represents a 1,2 or 3D view of memory whose layout is determined by the format of the
image (number and datatype of the channels).

**Programs:** Represents a hunk of code for a context. You can create Kernels from a linked 
program (i.e. all external dependencies resolved).

**Queue:** Represents a list of work (data transfers & kernel launches) and the graph of their
dependencies.

**Kernel:** Represents a callable function from a Program and associated function parameters.
Submit kernels with supplied parameters to a queue to execute them on the queue's 
context's devices.

**Event:** Represents a future return value from executing an asynchronous operation, such 
as a data transfer or kernel launch.

# Running a Kernel

Now, let's run our `mykernel` kernel that we have built up (see `04-std/01-index.md`). Recall
that our kernel code should be in a separate file. For our main function, we can have something
as shown below. This is assumes compilation for CUDA backend. Note that we import our 
`mykernels` module containing our kernel code and the dcompute driver for cuda.

```d
import std.stdio;
import ldc.dcompute;
import std.algorithm;
import std.stdio;
import std.file;
import std.traits;
import std.meta;
import std.exception : enforce;
import std.experimental.allocator;
import std.array;
import mykernels;
import dcompute.driver.cuda;

int main()
{
    enum size_t N = 128;
    float c = 5.0;
    float[N] res, x;
    foreach (i; 0 .. N)
    {
        x[i] = i;
    }

    Platform.initialise();

    auto devs = Platform.getDevices(theAllocator);
    auto dev   = devs[0];
    auto ctx   = Context(dev); scope(exit) ctx.detach();

    // Change the file to match your GPU.
    Program.globalProgram = Program.fromFile("kernels_cuda800_64.ptx");
    auto q = Queue(false);

    Buffer!(float) b_res, b_x;
    b_res =  Buffer!(float)(res[]); scope(exit) b_res.release();
    b_x   =  Buffer!(float)(x[]);   scope(exit) b_x.release();

    b_x.copy!(Copy.hostToDevice);

    q.enqueue!(mykernel)
              ([N,1,1],[1,1,1])
              (b_res,b_x,c);
    b_res.copy!(Copy.deviceToHost);

    foreach(i; 0 .. N)
        enforce(res[i] == x[i] + c);
    writeln(res[]);

    return 0;
}
```
It is important to change the file path on the `Program.fromFile("kernels_cuda800_64.ptx")` line
to the ptx file generated by the compilation step. Depending on how you set up dub, it may be in
`./.dub/obj` or just your project directory. You should verify that your kernels actually show
up in the ptx file after running dub build (it's in plaintext).

With the above example, we should get a successful run with the integers from 5 to 132 printed, since
our kernel adds c, which is 5 in this case, to the input vector, which has 0 to 127 in our case.

See `source/dcompute/tests` for examples of a slightly more complicated kernel and running with opencl driver.


================================================
FILE: docs/README.md
================================================
## Welcome to the DCompute documentation!

DCompute is a library that together with LDC is able to make D compile on GPU.

Dcompute is split into three sections, a driver, a standard library and a set of prewritten kernels.

The driver is intended to abstract the (rather unwieldy) compute API of CUDA and OpenCL.
But you can still pull all the leavers yourself if you feel the need.

The standard library contains the set of primitive operations exposed by the compute environment as well as other common operations.

These docs are designed to help getting started installing & using DCompute. 

## Table of Contents

0. Prerequisites to using DCompute
1. Installing DCompute
2. Understanding the hardware that DCompute runs on
3. Writing kernels
4. The device standard library
4.1 index
5. The compute API driver

You can find the corresponding Readme for each of the listed items in the parent `docs` directory, labelled with names
starting with 00 through 05. For the device standard library and compute API driver, look in the 
subdirectories `04-std` and `05-driver`, respectively. These instructions will help you install and execute
your first kernel with DCompute.

## D Basics Refresher

This guide assumes that the reader is familiar with the basics of D, although anyone 
familiar with the C family of languages should be able to understand most of it.

Some of the main differences are listed below:

The template instansiation operator is binary `!` in contrast to paired angle brackets
as in C++/C# et el. If `Foo` is a templated struct that takes one type parameter then 
`Foo!int foo;` declares a variable 

There is a third class of template parameters: aliases (the other two being types and values).
`alias` template parameters can, in addition to holding types and values, can hold _symbols_.
These include variables, functions and lambdas. `alias` when used outside of a template parameter 
list is the equivalent to `using`, in C++.

`~` is the concatenation operator, used unsurprisingly to concatenate arrays. 
Used widely in string manipulation.

Uniform Function Call Syntax (UFCS) allows you to call a free function as if it were a 
method of the type of its first argument (e.g. f(x,y) can be called as x.f(y)).
This together with optional parentheses,`x.f()` where `f` is a method or UFCS function of `x`
may be written as `x.f`, allows you to write chains of call `h(g(f(x)))` as `x.f.g.h`.

`class`es as are polymorphic reference types. `struct`s are value types. Idomatic D code 
tends to use structs over classes. Classes are not used at all in DCompute.

The `.` operator will implicity follow any pointers, although it will not dereference the last
one in a chain of `.`s. There is no operator `->` or `::`, these are both handled by `.`.

`static if` is D's conditional compilation construct. Code inside a taken branch is compiled 
into the object file, code inside a taken branch _not_ taken must be syntactically correct, but 
need not be semantically correct.

For more information see the [D Wiki](https://wiki.dlang.org/Coming_From).


================================================
FILE: dub.json
================================================
{
    "name": "dcompute",
    "description": "Native Heterogeneous Computing for D",
    "copyright": "Copyright © 2017, Nicholas Wilson",
    "authors": ["Nicholas Wilson"],
    "license": "BSL-1.0",
    "dependencies": {
        "derelict-cl"  : "~>3.2.0",
        "bindbc-cuda": "~>0.1.0",
        "taggedalgebraic": "~>0.10.7"
    },
    "configurations": [
        {
            "name": "library",
            "targetType": "library",
            "excludedSourceFiles": ["source/dcompute/tests/*"],
        },
        {
            "name": "unittest",
            "dflags" : ["-mdcompute-targets=cuda-210" ,"-oq"],
            "targetType": "executable",
            "versions": ["DComputeTesting"],
        },
        {
            "name": "test-cuda",
            "dflags" : ["--mdcompute-targets=cuda-210", "-oq"],
            "targetType": "executable",
            "versions": ["DComputeTestCUDA"],
        },
        {
            "name": "test-ocl",
            "dflags" : ["--mdcompute-targets=ocl-200", "-oq"],
            "targetType": "executable",
            "versions": ["DComputeTestOpenCL"],
        },
   ]
}


================================================
FILE: source/dcompute/driver/README.md
================================================
dcompute.driver
===============

Contains the abstracted driver interface for dcompute. It contains a delegation 
layer to the OpenCL (`dcompute.driver.ocl`) and CUDA (`dcompute.driver.cuda`)
drivers, code to load the appropriate system drivers and "get up and running" in a
platform agnoasic manner. Unless you're doing something that absolutely needs 
specific driver functionality then you should use this API rather than the 
individual compute APIs.

The API objects and their equivalets in OpenCL and CUDA are listed in the table 
below.

| Dcompute | CUDA        | OpenCL           |
| -------- | ----        | ------           |
| Platform | N/A\*       | cl_platform_id   |
| Device   | CUdevice    | cl_device_id     |
| Context  | CUcontext   | cl_context       |
| Queue    | CUstream    | cl_command_queue |
| Memory   | CUdeviceptr | cl_mem \*\*      |
| Module   | CUmodule    | cl_program       |
| Kernel   | CUfunction  | cl_kernel        |
| Event    | cudaEvent_t | cl_event         |

In addition there are a few Allocator types that allocate device local and shared 
virtual memory.

\* We make CUDA a platform of its own.

\*\* includes buffers, images and pipes.


Platform
--------

Performs the loading of the driver and handles any global initialisation required (e.g. `cuInit(0)`). Gives access to its devices.


Device
------

Can query them for information. You create `Context`s from them.


Context
-------

Create `Queue`s, `Memory` objects, `Module`s from these. By default this stores a
single `Module` along with it, created for all devices that are part of this 
context.


Queue
-----

Submit `Kernel` invocations and `Memory` transfers/maps (returning `Event`s),
 set `Device` affinity.


Kernel
------

Extracted from `Module`s. You can choose to not use these directly if you wish and
let this library do all the API bashing for you striaght from the module.
However you can extract these from `Module`s if you wish to avoid the re-extraction 
costs.


Event
-----

Returned as a result of enqueuing something. You can set callback on these, or wait 
on them. Useful for synchronisation



================================================
FILE: source/dcompute/driver/backend.d
================================================
module dcompute.driver.backend;

enum Backend
{
    OpenCL120,
    CUDA650,
}


================================================
FILE: source/dcompute/driver/cuda/TODO
================================================
cuLink.*
cuIpc.*
cuTexRef.*
cuTexObj.*
cuSurfRef.*
cuSurfObj.*


================================================
FILE: source/dcompute/driver/cuda/buffer.d
================================================
module dcompute.driver.cuda.buffer;

import dcompute.driver.cuda;

struct Buffer(T)
{
    size_t raw;

	// Host memory associated with this buffer
    T[] hostMemory;

    this(size_t elems)
    {
        status = cast(Status)cuMemAlloc(&raw,elems * T.sizeof);
        checkErrors();
        hostMemory = null;
    }

    this(T[] arr)
    {
        status = cast(Status)cuMemAlloc(&raw,arr.length * T.sizeof);
        checkErrors();
        hostMemory = arr;
    }
    void copy(Copy c)()
    {
        static if (c == Copy.hostToDevice)
        {
            status = cast(Status)cuMemcpyHtoD(raw, hostMemory.ptr,hostMemory.length * T.sizeof);
        }
        else static if  (c == Copy.deviceToHost)
        {
            status = cast(Status)cuMemcpyDtoH(hostMemory.ptr,raw,hostMemory.length * T.sizeof);
        }
        checkErrors();
    }
    alias hostArgOf(U : GlobalPointer!T) = raw; 
    void release()
    {
        status = cast(Status)cuMemFree(raw);
        checkErrors();
        raw = 0;
        hostMemory = null;
    }
}

alias bf = Buffer!float;


================================================
FILE: source/dcompute/driver/cuda/context.d
================================================
module dcompute.driver.cuda.context;

import dcompute.driver.cuda;

struct Context
{
    CUcontext raw;
    this(Device dev, uint flags = 0)
    {
        status = cast(Status)cuCtxCreate(&raw, flags,dev.raw);
        checkErrors();
    }
    
    static void push(Context ctx)
    {
        status = cast(Status)cuCtxPushCurrent(ctx.raw);
        checkErrors();
    }
    
    static Context pop()
    {
        Context ret;
        status = cast(Status)cuCtxPopCurrent(&ret.raw);
        checkErrors();
        return ret;
    }
    static @property Context current()
    {
        Context ret;
        status = cast(Status)cuCtxGetCurrent(&ret.raw);
        checkErrors();
        return ret;
    }
    
    static @property void current(Context ctx)
    {
        status = cast(Status)cuCtxSetCurrent(ctx.raw);
        checkErrors();
    }
    
    static void sync()
    {
        status = cast(Status)cuCtxSynchronize();
        checkErrors();
    }
    //CUlimit
    enum Limit
    {
        stackSize,
        printfFifoSize,
        mallocHeapSize,
        deviceRuntimeSyncDepth,
        deviceRuntimePendingLaunchCount
    }
    
    static @property void limit(Limit what)(size_t lim)
    {
        status = cast(Status)cuCtxSetLimit(what,lim);
        checkErrors();
    }
    
    static @property size_t limit(Limit what)()
    {
        size_t ret;
        status = cast(Status)cuCtxGetLimit(&ret,what);
        checkErrors();
        return ret;
    }
    //CUfunc_cache
    enum CacheConfig
    {
        preferNone,
        preferShared,
        preferL1,
        preferEqual,
    }
    
    static @property void cacheConfig(CacheConfig cc)
    {
        status = cast(Status)cuCtxSetSharedMemConfig(cc);
        checkErrors();
    }
    
    
    static @property CacheConfig cacheConfig()
    {
        CacheConfig ret;
        status = cast(Status)cuCtxGetSharedMemConfig(cast(int*)&ret);
        checkErrors();
        return ret;
    }
    
    @property uint apiVersion()
    {
        uint ret;
        status = cast(Status)cuCtxGetApiVersion(raw,&ret);
        checkErrors();
        return ret;
    }
    
    static void getQueuePriorityRange(out int lo, out int hi)
    {
        status = cast(Status)cuCtxGetStreamPriorityRange(&lo,&hi);
        checkErrors();
    }
    
    void detach()
    {
        status = cast(Status)cuCtxDetach(raw);
        checkErrors();
    }
}


================================================
FILE: source/dcompute/driver/cuda/device.d
================================================
module dcompute.driver.cuda.device;

import dcompute.driver.cuda;

struct Device
{
    int raw;
    //struct CUdevprop
    static struct Info
    {
        @(1)  int maxThreadsPerBlock;
        @(2)  int maxThreadsDimX;
        @(3)  int maxThreadsDimY;
        @(4)  int maxThreadsDimZ;
        @(5)  int maxGridSizeX;
        @(6)  int maxGridSizeY;
        @(7)  int maxGridSizeZ;
        @(8)  int sharedMemPerBlock;
        @(9)  int totalConstantMemory;
        @(10) int SIMDWidth; // warp size
        @(11) int maxPitch;
        @(12) int regsPerBlock;
        @(13) int clockRate;
        @(14) int textureAlign;
        @(15) int GPUOverlap;
        @(16) int multiprocessorCount;
        @(17) int kernelExecTimeout;
        @(18) int integrated;
        @(19) int canMapHostMemeory;
        @(20) int computeMode;
        @(21) int maxTexture1DWidth;
        @(22) int maxTexture2DWidth;
        @(23) int maxTexture2DHeight;
        @(24) int maxTexture3DWidth;
        @(25) int maxTexture3DHeight;
        @(26) int maxTexture3DDepth;
        @(27) int maxTexture2DLayeredWidth;
        @(28) int maxTexture2DLayeredHeight;
        @(29) int maxTexture2DLayeredLayers;
        @(27) int maxTexture2DArrayWidth;
        @(28) int maxTexture2DArrayHeight;
        @(29) int maxTexture2DArrayNumSlices;
        @(30) int surfaceAlignment;
        @(31) int concurrentKernels;
        @(32) int eccEnabled;
        @(33) int PCIBusID;
        @(34) int PCIDeviceID;
        @(35) int tccDriver;
        @(36) int memoryClockRate;
        @(37) int globalMemoryBusWidth;
        @(38) int L2CacheSize;
        @(39) int maxThreadPerMultiProcessor;
        @(40) int asyncEngineCount;
        @(41) int unifiedAddressing;
        @(42) int maxTexture1DLayeredWidth;
        @(43) int maxTexture1DLayeredLayers;
        @(44) int canTex2DGather;
        @(45) int maxTextrue2DGatherWidth;
        @(46) int maxTextrue2DGatherHeight;
        @(47) int maxTexture3DWidthAlternative;
        @(48) int maxTexture3DHeightAlternative;
        @(49) int maxTexture3DDepthAlternative;
        @(50) int PICDomainID;
        @(51) int texturePitchAlignment;
        @(52) int textureCubemapWidth;
        @(53) int textureCubemapLayeredWidths;
        @(54) int textureCubemapLayeredLayers;
        @(55) int maxSurface1DWidth;
        @(56) int maxSurface2DWidth;
        @(57) int maxSurface2DHeight;
        @(58) int maxSurface3DWidth;
        @(59) int maxSurface3DHeight;
        @(60) int maxSurface3DDepth;
        @(61) int maxSurface1DLayeredWidth;
        @(62) int maxSurface1DLayeredLayers;
        @(63) int maxSurface2DLayeredWidth;
        @(64) int maxSurface2DLayeredHeight;
        @(65) int maxSurface2DLayeredLayers;
        @(66) int maxSurfaceCubemapWidth;
        @(67) int maxSurfaceCubemapLayeredWidth;
        @(68) int maxSurfaceCubemapLayeredLayers;
        @(69) int maxTaxture1DLinearWidth;
        @(70) int maxTaxture2DLinearWidth;
        @(71) int maxTaxture2DLinearHeight;
        @(72) int maxTaxture2DLinearPitch;
        @(73) int maxTaxture2DMipmappedWidth;
        @(74) int maxTaxture2DMipmappedHeight;
        @(75) int computeCapabilityMajor;
        @(76) int computeCapabilityMinor;
        @(77) int maxTaxture1DMipmappedWidth;
        @(78) int streamPrioritiesSupported;
        @(79) int globalL1CacheSupported;
        @(80) int localL1CacheSupported;
        @(81) int maxSharedMemoryPerMultiprocessor;
        @(82) int maxRegistorsPerMultiprocessor;
        @(83) int managedMemory;
        @(84) int multiGPUBoard;
        @(85) int multiGPUBoardGroupID;
    }
    
    @property size_t totalMemory()
    {
        size_t ret;
        status = cast(Status)cuDeviceTotalMem(&ret,raw);
        checkErrors();
        return ret;
    }
    
    //char[] name : cuDeviceGetName

    static foreach (mem; __traits(allMembers, Info)) {
        mixin(
            ` @property int `,
            mem,
            ` () { int result; `,
            ` status = cast(Status)cuDeviceGetAttribute( `,
            ` &result, `,
            ` cast(CUdevice_attribute) `,
             __traits(getAttributes, __traits(getMember, Info, mem))[0].stringof,
            `, raw); `,
            ` checkErrors(); `,
            ` return result; `,
            ` } `);
    }

  // Unified Memory capability helpers

    /**
     * Returns true when the device supports CUDA Managed Memory
     * (cuMemAllocManaged / UnifiedBuffer).
     * Requires Compute Capability >= 3.0.
     * Wraps CU_DEVICE_ATTRIBUTE_MANAGED_MEMORY (attribute 83).
     */
    @property bool supportsUnifiedMemory()
    {
        return managedMemory != 0;
    }

    /**
     * Returns true when the device participates in Unified Virtual
     * Addressing (UVA) i.e. the same virtual address is valid on
     * both host and device.  True on all 64-bit CUDA systems with
     * CC >= 2.0.
     * Wraps CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING (attribute 41).
     */
    @property bool supportsUnifiedAddressing()
    {
        return unifiedAddressing != 0;
    }
}


================================================
FILE: source/dcompute/driver/cuda/event.d
================================================
module dcompute.driver.cuda.event;

import dcompute.driver.cuda;

struct Event
{
    CUevent raw;
    
}


================================================
FILE: source/dcompute/driver/cuda/kernel.d
================================================
module dcompute.driver.cuda.kernel;

import dcompute.driver.cuda;
struct Kernel(F) if (is(F==function)|| is(F==void))
{
    CUfunction raw;
    
    static struct Attributes
    {
        @(0) int maxThreadsPerBlock;
        // in Bytes
        @(1) int sharedSize;
        @(2) int constSize;
        @(3) int localSize;
        
        @(4) int numRegs;
        @(5) int ptxVersion;
        @(6) int binaryVersion;
        @(7) int cacheModeCa;
    }

}


================================================
FILE: source/dcompute/driver/cuda/memory.d
================================================
module dcompute.driver.cuda.memory;

import dcompute.driver.error;
import dcompute.driver.cuda;

// void pointer like
struct MemoryPointer
{
    size_t raw;
    static MemoryPointer allocate(size_t nbytes)
    {
        MemoryPointer ret;
        status = cast(Status)cuMemAlloc(&ret.raw,nbytes);
        checkErrors();
        return ret;
    }
    //static MemoryPointer allocatePitch(T)(size_t nbytes)

    Memory addressRange()
    {
        Memory ret;
        status = cast(Status)cuMemGetAddressRange(&ret.ptr.raw,&ret.length,raw);
        checkErrors();
        return ret;
    }

}

// void[] like
struct Memory
{
    MemoryPointer ptr;
    size_t length;

    enum CopySource
    {
        Host,
        Device,
        Array
    }

    // cuMemcpy and friends
    // TODO: implement this properly
    /*
    template copy(T, CopySource from, CopySource to, int dimentions = 1,
                  Flag!"peer" _peer = No.peer)
    {
        auto copy(Memory to)
        {
            status = cast(Status)cuMemcpy(to.ptr.raw,ptr.raw,length);
            checkErrors();
        }
    }*/

    // TODO: cuMemset & frineds

}


================================================
FILE: source/dcompute/driver/cuda/package.d
================================================
module dcompute.driver.cuda;

public import ldc.dcompute;
public import bindbc.cuda;

public import dcompute.driver.error;

public import dcompute.driver.cuda.buffer;
public import dcompute.driver.cuda.context;
public import dcompute.driver.cuda.device;
public import dcompute.driver.cuda.event;
public import dcompute.driver.cuda.kernel;
public import dcompute.driver.cuda.memory;
public import dcompute.driver.cuda.platform;
public import dcompute.driver.cuda.program;
public import dcompute.driver.cuda.queue;
public import dcompute.driver.cuda.unified_buffer;

enum Copy
{
    hostToDevice,
    deviceToHost,
    array,
}

enum MemoryBankConfig : int
{
    default_,
    fourBytes,
    eightBytes,
}
template HostArgsOf(F) {
    import std.meta, std.traits;
    alias HostArgsOf = staticMap!(ReplaceTemplate!(Pointer, Buffer), Parameters!F);
}
private template ReplaceTemplate(alias needle, alias replacement) {
    template ReplaceTemplate(T) {
        static if (is(T : needle!Args, Args...)) {
            alias ReplaceTemplate = replacement!(Args[1]);
        } else {
            alias ReplaceTemplate = T;
        }
    }
}


================================================
FILE: source/dcompute/driver/cuda/platform.d
================================================
module dcompute.driver.cuda.platform;

import dcompute.driver.error;
import dcompute.driver.cuda;
import std.experimental.allocator.typed;

struct Platform
{
    static void initialise(uint flags =0)
    {
        auto support = loadCUDA();
        if (support == CUDASupport.noLibrary || support == CUDASupport.badLibrary)
        {
            status = Status.sharedObjectInitFailed;
            checkErrors();
        }
        status = cast(Status)cuInit(flags);
        checkErrors();
    }
    
    static Device[] getDevices(A)(A a)
    {
        int len;
        TypedAllocator!(A) allocator;
        status = cast(Status)cuDeviceGetCount(&len);
        checkErrors();

        //TODO:
        //Device[] ret = allocator.makeArray!(Device)(len);
            Device[] ret = new Device[len];
        foreach(int i; 0 .. len)
        {
            status = cast(Status)cuDeviceGet(&ret[i].raw,i);
            checkErrors();
        }
        return ret;
    }
    
}


================================================
FILE: source/dcompute/driver/cuda/program.d
================================================
module dcompute.driver.cuda.program;

import dcompute.driver.cuda;

import std.string;
struct Program
{
    CUmodule raw;
    
    Kernel!void getKernelByName(immutable(char)* name)
    {
        Kernel!void ret;
        status = cast(Status)cuModuleGetFunction(&ret.raw,this.raw,name);
        checkErrors();
        return ret;
    }
    Kernel!(typeof(k)) getKernel(alias k)()
    {
        return cast(typeof(return)) getKernelByName(k.mangleof.ptr);
    }
    // TODO: Support globals & images. Requires competent compiler. 
    //cuModuleGetGlobal
    //cuModuleGetTexRef
    //cuModuleGetSurfRef
    
    static Program fromFile(string name)
    {
        Program ret;
        status = cast(Status)cuModuleLoad(&ret.raw,name.toStringz);
        checkErrors();
        return ret;
    }

    static Program fromString(string name)
    {
        Program ret;
        status = cast(Status)cuModuleLoadData(&ret.raw,name.toStringz);
        checkErrors();
        return ret;
    }
    
    __gshared static Program globalProgram;
    //cuModuleLoadDataEx
    //cuModuleLoadFatBinary
    
    void unload()
    {
        status = cast(Status)cuModuleUnload(raw);
        checkErrors();
    }
    
    //TODO: linkstate
}





================================================
FILE: source/dcompute/driver/cuda/queue.d
================================================
// A stream in CUDA speak
module dcompute.driver.cuda.queue;

import dcompute.driver.cuda;
struct Queue
{
    CUstream raw;
    this (bool async)
    {
        status = cast(Status)cuStreamCreate(&raw, async ? 1 : 0);
        checkErrors();
    }
    this (bool async, int priority)
    {
        status = cast(Status)cuStreamCreateWithPriority(&raw, async ? 1 : 0, priority);
        checkErrors();
    }
    
    @property bool async()
    {
        uint ret;
        status = cast(Status)cuStreamGetFlags(raw,&ret);
        checkErrors();
        return cast(bool) ret;
    }
    
    @property int priority()
    {
        int ret;
        status = cast(Status)cuStreamGetPriority(raw,&ret);
        checkErrors();
        return ret;
    }

    void wait(Event e,uint flags)
    {
        status = cast(Status)cuStreamWaitEvent(raw,e.raw,flags);
        checkErrors();
    }
    
    // cuMemcpy.*Async and friends
    // TODO: implement this properly
    /*template copy(T, CopySource from, CopySource to, int dimentions = 1,
                  Flag!"peer" _peer = No.peer)
    {
        auto copy(Memory to)
        {
            status = cast(Status)cuMemcpy(to.ptr.raw,ptr.raw,length);
            checkErrors();
        }
    }*/

    
    /*void addCallback(void delegate(Queue,Status) dg)
    {
        static CUstreamCallback cb = (void* ,Status void*) =>
        cuStreamAddCallback
    }*/
    
    auto enqueue(alias k)(uint[3] _grid, uint[3] _block, uint _sharedMem = 0)
    {
        static struct Call
        {
            Queue q;
            uint[3] grid, block;
            uint sharedMem;
            
            this(Queue _q,uint[3] _grid, uint[3] _block, uint _sharedMem)
            {
                q= _q;
                grid = _grid;
                block = _block;
                sharedMem = _sharedMem;
            }
            //TODO integrate evnts into this.
            void opCall(HostArgsOf!(typeof(k)) args)
            {
                auto kernel = Program.globalProgram.getKernel!k();
                void*[typeof(args).length] vargs;
                foreach(uint i, ref a; args)
                {
                    vargs[i] = cast(void*)&a;
                }
                
                status = cast(Status)
                        cuLaunchKernel(kernel.raw,
                                       grid[0], grid[1], grid[2],
                                       block[0],block[1],block[2],
                                       sharedMem,
                                       q.raw,
                                       vargs.ptr,
                                       null);
                checkErrors();
            }
        }
        
        return Call(this,_grid,_block,_sharedMem);
    }
}


================================================
FILE: source/dcompute/driver/cuda/unified_buffer.d
================================================
/**
 * Unified Memory (Managed Memory) buffer for CUDA.
 *
 * A UnifiedBuffer!T allocates memory that is accessible from both the host
 * (CPU) and the device (GPU) through a single pointer. The CUDA runtime
 * migrates data automatically, so explicit copy!(Copy.hostToDevice) /
 * copy!(Copy.deviceToHost) calls are not needed.
 *
 *
 * Requirements:
 *   - CUDA Compute Capability >= 3.0
 *   - Device.supportsUnifiedMemory == true
 */
module dcompute.driver.cuda.unified_buffer;

import dcompute.driver.cuda;

// Attach mode — controls which streams can access the managed allocation
// initially. CU_MEM_ATTACH_GLOBAL makes the buffer immediately visible to
// all streams (the most common choice). CU_MEM_ATTACH_HOST restricts it to
// the host until a stream explicitly attaches to it.

enum AttachMode : uint
{
    /// Accessible from any CUDA stream (default). Equivalent to
    /// CU_MEM_ATTACH_GLOBAL.
    global_ = CU_MEM_ATTACH_GLOBAL,

    /// Initially host-only. Use cuStreamAttachMemAsync (not yet wrapped) or
    /// switch to global_ to make the buffer available on the device.
    /// Equivalent to CU_MEM_ATTACH_HOST.
    host = CU_MEM_ATTACH_HOST,
}

struct UnifiedBuffer(T)
{
    /// Raw CUdeviceptr — also a valid host-side pointer on unified-memory
    /// capable systems (UVA must be enabled, which is true on all 64-bit CUDA
    /// systems with CC >= 2.0).
    size_t raw;

    private size_t _length; // number of T elements

    // ------------------------------------------------------------------
    // Construction
    // ------------------------------------------------------------------

    /**
     * Allocate `elems` uninitialised elements of T in managed memory.
     *
     * Params:
     *   elems = number of elements to allocate
     *   mode  = attachment scope (default: global_)
     */
    @trusted this(size_t elems, AttachMode mode = AttachMode.global_)
    {
        status = cast(Status)cuMemAllocManaged(&raw, elems * T.sizeof,
                                              cast(uint)mode);
        checkErrors();
        _length = elems;
    }

    /**
     * Allocate and initialise from a host slice.
     * The contents of `arr` are copied into the managed allocation before
     * returning, so the caller's original array is no longer needed.
     *
     * Params:
     *   arr  = source host data
     *   mode = attachment scope (default: global_)
     */
    this(T[] arr, AttachMode mode = AttachMode.global_)
    {
        this(arr.length, mode);
        hostSlice[] = arr[];
    }

    // ------------------------------------------------------------------
    // Host-side access
    // ------------------------------------------------------------------

    /**
     * Returns a D slice backed by the managed allocation.
     * Valid to read/write on the host at any time when no kernel is
     * concurrently accessing the same memory.
     */
    @property @trusted T[] hostSlice()
    {
        return (cast(T*)raw)[0 .. _length];
    }

    /// Number of elements.
    @property size_t length() const { return _length; }

 
    // Device-side hints

    /**
     * Prefetch this buffer's data to a device asynchronously.
     *
     * Initiates memory migration to the specified device prior to kernel execution
     * to avoid on-demand page migration latency.
     *
     * Note: Explicit prefetching requires CUDA 8.0 or higher. On older drivers
     * where `cuMemPrefetchAsync` is not available, this is a silent no-op —
     * unified memory still works correctly via demand paging.
     */
    @trusted void prefetch(Device dev, Queue q = Queue.init)
    {
        if (cuMemPrefetchAsync == null)
            return;

        status = cast(Status)cuMemPrefetchAsync(cast(CUdeviceptr)raw, _length * T.sizeof, dev.raw, q.raw);
        checkErrors();
    }


    /// Free the managed allocation.  After this call `raw` and `length`
    /// are zeroed; accessing `hostSlice` is undefined behaviour.
    @trusted void release()
    {
        status = cast(Status)cuMemFree(raw);
        checkErrors();
        raw = 0;
        _length = 0;
    }

    /// Satisfies the same hostArgOf alias contract as Buffer!T so that
    /// HostArgsOf!kernelFn replaces GlobalPointer!T with UnifiedBuffer!T
    /// transparently.
    alias hostArgOf(U : GlobalPointer!T) = raw;

    // Implicit conversion to Buffer!T

    /**
     * Returns a Buffer!T view of this managed allocation.
     *
     * This conversion exists so that UnifiedBuffer!T can be passed directly
     * to Queue.enqueue!() whose opCall signature is fixed at compile-time
     * to (Buffer!float, ...) via HostArgsOf.  Because both structs store
     * `raw` as their first field, CUDA receives the correct CUdeviceptr.
     *
     * The Buffer's hostMemory slice is set to hostSlice so that if anyone
     * accidentally calls copy!() on the returned Buffer, it still touches
     * the right memory region.
     */
    @property Buffer!T asBuffer()
    {
        Buffer!T b;
        b.raw        = raw;
        b.hostMemory = hostSlice;
        return b;
    }

    /// Implicit subtype: UnifiedBuffer!T is accepted wherever Buffer!T is
    /// expected (e.g. Queue.enqueue!() opCall arguments).
    alias this = asBuffer;
}


================================================
FILE: source/dcompute/driver/error.d
================================================
/**/

module dcompute.driver.error;

// Helpfully OpenCL errors are negative and CUDAs are positive
enum Status : int {
    Success = 0,
    // CUDA Errors.
    invalidValue                = 1,
    outOfMemory                 = 2,
    notInitialized              = 3,
    deinitialized               = 4,
    profilerDisabled            = 5,
    profilerNotInitialized      = 6,
    profilerAlreadyStarted      = 7,
    profilerAlradyStopped       = 8,
    noDevice                    = 100,
    invalidDevice               = 101,
    invalidImage                = 200,
    invalidContext              = 201,
    contextAlreadyCurrent       = 202,
    mapFailed                   = 205,
    unmapFailed                 = 206,
    arrayIsMapped               = 207,
    alreadyMapped               = 208,
    noBinaryForGPU              = 209,
    alreadyAcquired             = 210,
    notMapped                   = 211,
    notMappedAsArray            = 212,
    notMappedAsPointer          = 213,
    eccUncorrectable            = 214,
    unsupportedLimit            = 215,
    contextAlredyInUse          = 216,
    peerAccessUnsupported       = 217,
    invalidPtx                  = 218,
    invalidGraphicsContext      = 219,
    nvlinkUncorrectable         = 220,
    jitCompilerNotFound         = 221,
    invalidSource               = 300,
    fileNotFound                = 301,
    sharedObjectSymbolNotFound  = 302,
    sharedObjectInitFailed      = 303,
    operatingSystem             = 304,
    invalidHandle               = 400,
    illegalState                = 401,
    notFound                    = 500,
    notReady                    = 600,
    illegalAddress              = 700,
    launchOutOfResources        = 701,
    launchTimeout               = 702,
    launchIncompatibleTexturing = 703,
    peerAccessAlreadyEnabled    = 704,
    peerAccessNotEnabled        = 705,
    primaryContextActive        = 708,
    contextIsDestroyed          = 709,
    assertError                 = 710,
    tooManyPeers                = 711,
    hostMemoryAlreadyRegistered = 712,
    hostMemoryNotRegistered     = 713,
    hardwareStackError          = 714,
    illegalInstruction          = 715,
    misalignedAddress           = 716,
    invalidAddressSpace         = 717,
    invalidPC                   = 718,
    launchFailed                = 719,
    cooperativeLaunchTooLarge   = 720,
    notPermitted                = 800,
    notSupported                = 801,
    systemNotReady              = 802,
    systemDriverMismatch        = 803,
    compatNotSupportedOnDevice  = 804,
    streamCaptureUnsupported    = 900,
    streamCaptureInvalidated    = 901,
    streamCaptureMerge          = 902,
    streamCaptureUnmatched      = 903,
    streamCaptureUnjoined       = 904,
    streamCaptureIsolation      = 905,
    streamCaptureImplicit       = 906,
    capturedEvent               = 907,
    streamCaptureWrongThread    = 908,
    unknown                     = 999,

    // OpenCL Errors.
    deviceNotFound                 = -1,
    deviceNotAvailable             = -2,
    compilerNotAvailable           = -3,
    memoryObjectAloocationFailure  = -4,
    outOfResources                 = -5,
    outOfHostMemory                = -6,
    profilingInfomationAvailable   = -7,
    memoryCopyOverlap              = -8,
    imageFormatMismatch            = -9,
    imageFormatNotSupported        = -10,
    buildProgramFailed             = -11,
    mapFailure                     = -12,
    misalignedSubBufferOffset      = -13,
    errorsInWaitList               = -14,
    compileProgramFailure          = -15,
    linkerNotAvailable             = -16,
    linkerFailure                  = -17,
    devicePartitionFailure         = -18,
    kernelArgInfoNotAvailable      = -19,
    
    invalidValueCL                 = -30,
    invalidDeviceType              = -31,
    invalidPlatform                = -32,
    invalidDeviceCL                = -33,
    invalidContextCL               = -34,
    invalidQueueProperties         = -35,
    invalidQueue                   = -36,
    invalidHostPointerCL           = -37,
    invalidMemoryObject            = -38,
    invalidImageFormatDesctiptor   = -39,
    invalidImageSize               = -40,
    invalidSampler                 = -41,
    invalidBinary                  = -42,
    invalidBuildOptions            = -43,
    invalidProgram                 = -44,
    invalidExecutable              = -45,
    invalidKernelName              = -46,
    invalidKernelDefinition        = -47,
    invalidKernel                  = -48,
    invalidArgumentIndex           = -49,
    invalidArgumentValue           = -50,
    invalidArgumentSize            = -51,
    invalidKernelArguments         = -52,
    invalidWorkDimensions          = -53,
    invaildWorkGroupSize           = -54,
    invaildWorkItemSize            = -55,
    invalidGlobalOffest            = -56,
    invalidEventWaitList           = -57,
    invalidEvent                   = -58,
    invalidOperation               = -59,
    invalidGLObject                = -60,
    invalidBufferSize              = -61,
    invalidMipLevel                = -62,
    invalidGlobalWorkSize          = -63,
    invalidProperty                = -64,
    invalidImageDescriptor         = -65,
    invalidCompilerOptions         = -66,
    invalidLinkerOptions           = -67,
    invalidDevicePartitionCount    = -68,
    
    invalidGLSharegroupReference   = -1000,
    platformNotFound               = -1001,
    invalidD3D10Device             = -1002,
    invalidD3D10Resource           = -1003,
    D3D10ResouceAlreadyAcquired    = -1004,
    D3D10ResourceNotAcquires       = -1005,
    invalidD3D11Device             = -1006,
    invalidD3D11Resource           = -1007,
    D3D11ResourceAlredyAcquired    = -1008,
    D3D11ResourceNotAcquired       = -1009,
    invalidDX9MediaAdapter         = -1010,
    invalidDX9MediaSurface         = -1011,
    DX9MediaSurfaceAlreadyAcquired = -1012,
    DX9MediaSurfaceNotAcquired     = -1013,
    
    devicePartitionFailed          = -1057,
    invalidPartitionCount          = -1058,
    invalidPartitionName           = -1059,
    
    invalidEGLObject               = -1093,
    EGLResourceNotAcquired         = -1092,
}

version (D_BetterC)
{
    void delegate (Status) nothrow @nogc onDriverError = (Status _status) 
    { 
        defaultOnDriverError(_status);
    };
    
    immutable void delegate (Status) nothrow @nogc defaultOnDriverError = 
    (Status _status)
    {
        import core.stdc.stdio : fprintf, stderr;
        import std.conv : to;
        fprintf(stderr,"*** DCompute driver error:%s\n",
               _status.to!(string).toStringz);
    };
}
else
{
    class DComputeDriverException : Exception
    {
        this(string msg, string file = __FILE__,
             size_t line = __LINE__, Throwable next = null)
        {
            super(msg, file, line, next);
        }
        
        this(Status err, string file = __FILE__, 
             size_t line = __LINE__, Throwable next = null)
        {
            import std.conv : to;
            super(err.to!string, file, line, next);
        }
    }
    void delegate(Status) onDriverError = (Status _status) 
    {
        defaultOnDriverError(_status);
    };
    immutable void delegate(Status) defaultOnDriverError =
    (Status _status)
    {
        throw new DComputeDriverException(_status);
    };
}

// Thread local status
Status status;

version(DComputeIgnoreDriverErrors)
{
    void checkErrors() {}
}
else
{
    void checkErrors()
    {
        if (status) onDriverError(status);
    }

}


================================================
FILE: source/dcompute/driver/ocl/buffer.d
================================================
module dcompute.driver.ocl.buffer;

import dcompute.driver.ocl;

struct Buffer(T)
{
    cl_mem raw;

    // Host memory associated with this buffer
    T[] hostMemory;
    enum CreateType
    {
        region =0x1220,
    }
    // opSlice clCreateSubBuffer
}


================================================
FILE: source/dcompute/driver/ocl/context.d
================================================
module dcompute.driver.ocl.context;

import dcompute.driver.ocl;
import std.typecons;

import std.experimental.allocator.typed;

struct Context
{
    cl_context raw;
    
    enum Properties
    {
        platform        = 0x1084,
        interopUserSync = 0x1085,
    }
    
    static struct Info
    {
        @(0x1080) uint referenceCount;
        @(0x1081) Device* _devices;
        @(0x1082) Context.Properties* properties;
        @(0x1083) uint numDevices;
        ArrayAccesssor!(_devices,numDevices) devices;
        // Extensions
        //@(0x2010) khrTerminate;
        //@(0x200E) khrMemoryInitialise;
        //@(0x4014) CONTEXT_D3D10_DEVICE_KHR
        //@(0x402C) CONTEXT_D3D10_PREFER_SHARED_RESOURCES_KHR
        //@(0x401D) CONTEXT_D3D11_DEVICE_KHR
        //@(0x402D) CONTEXT_D3D11_PREFER_SHARED_RESOURCES_KHR
        //@(0x2025) CONTEXT_ADAPTER_D3D9_KHR
        //@(0x2026) CONTEXT_ADAPTER_D3D9EX_KHR
        //@(0x2027) CONTEXT_ADAPTER_DXVA_KHR
        //@(0x2008) GL_CONTEXT_KHR
        //@(0x2009) EGL_DISPLAY_KHR
        //@(0x200A) GLX_DISPLAY_KHR
        //@(0x200B) WGL_HDC_KHR
        //@(0x200C) CGL_SHAREGROUP_KHR

    }
    //mixin(generateGetInfo!(Info,clGetContextInfo));
    
    this(Device[] devs,const Properties[] props)
    {
        raw = clCreateContext(cast(const cl_context_properties*)props.ptr,
                              cast(uint)devs.length,cast(const cl_device_id*)devs.ptr,
                              null,null,
                              cast(int*)&status);
        checkErrors();
    }
    
    this(Device.Type type,const Properties[] props)
    {
        raw = clCreateContextFromType(cast(const cl_context_properties*)props.ptr,
                                      cast(cl_device_type)type,
                                      null,null,
                                      cast(int*)&status);
        checkErrors();
    }
    void retain()
    {
        status = cast(Status)clRetainContext(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseContext(raw);
        checkErrors();
    }
    
    Queue createQueue(Device dev,Queue.Properties prop)
    {
        Queue ret;
        ret.raw = clCreateCommandQueue(this.raw,
                                       dev.raw,
                                       cast(cl_command_queue_properties)prop,
                                       cast(int*)&status);
        checkErrors();
        return ret;
    }
    
    Buffer!T createBuffer(T)(T[] arr,Memory.Flags flags = (Memory.Flags.useHostPointer | Memory.Flags.readWrite))
    {
        import std.stdio;
        Buffer!T ret;
        auto len = memSize(arr);
        ret.raw = clCreateBuffer(raw,flags,len,arr.ptr,cast(int*)&status);
        ret.hostMemory = arr;
        checkErrors();
        return ret;
    }
    
    /*Image.Format[] supportedImageFormats(A)(A allocator, Memory.Flags f,Memory.Type t)
    {
        //Double call
        clGetSupportedImageFormats
    }*/
    
    Sampler createSampler(Flag!"normalisedCoordinates" f,
                          Sampler.AddressingMode aMode,
                          Sampler.FilterMode fMode)
    {
        Sampler ret;
        ret.raw = clCreateSampler(this.raw,
                                  cast(cl_bool)f,
                                  cast(cl_addressing_mode)aMode,
                                  cast(cl_filter_mode)fMode,
                                  cast(int*)&status);
        checkErrors();
        return ret;
    }
    
    /**Program createProgramFromSource(string[][] sources)
     {
        clCreateProgramWithSource
     }
    */
    
    Program createProgramFromSPIR(A)(A a, Device[] devices,ubyte[] spir)
    {
        auto allocator = TypedAllocator!(A)(a);
        auto lengths = allocator.makeArray!(size_t)(devices.length);
        lengths[]    = spir.length;
        auto ptrs  = allocator.makeArray!(ubyte*)(devices.length);
        ptrs[]       = spir.ptr;
        Program ret;

        ret.raw = clCreateProgramWithBinary(
                                this.raw,
                                cast(uint)devices.length, cast(cl_device_id*)devices.ptr,
                                lengths.ptr,ptrs.ptr,
                                null, // TODO report individual errors
                                cast(int*)&status);
        allocator.dispose(lengths);
        allocator.dispose(ptrs);
        return ret;
    }
    Program createProgram(void[] spirv)
    {
        Program ret;

        ret.raw = clCreateProgramWithIL(this.raw,
										spirv.ptr,
										spirv.length,
										cast(int*)&status);
        return ret;
    }
    
    /*Program createProgramFromBuiltinKernels(Device[] devices, string kernelNames)
    {
        clCreateProgramWithBuiltInKernels
    }*/
}


================================================
FILE: source/dcompute/driver/ocl/device.d
================================================
module dcompute.driver.ocl.device;

import derelict.opencl.cl;
import dcompute.driver.ocl;
import std.meta: AliasSeq;

struct Device
{
    enum Type : cl_bitfield
    {
        default_     = 0x1,
        CPU         = 0x2,
        GPU         = 0x4,
        accelerator = 0x8,
        custom      = 0x10,
        all         = 0xFFFFFFFF
    }
    
    enum AffinityDomain : cl_bitfield
    {
        numa        = 0x1,
        l4_Cache    = 0x2,
        l3_Cache    = 0x4,
        l2_Cache    = 0x8,
        l1_Cache    = 0x10,
        nextPartitionable = 0x20
    }
    
    enum PartitionProperty : long
    {
        Equally          = 0x1086,
        ByCounts         = 0x1087,
        ByCountsListEnd  = 0,
        ByAffinityDomain = 0x1088,
    }
    
    enum FPConfig : cl_bitfield
    {
        denorm                  = 1 << 0,
        infNan                  = 1 << 1,
        roundNearest            = 1 << 2,
        roundZero               = 1 << 3,
        rounfInf                = 1 << 4,
        fma                     = 1 << 5,
        softFloat               = 1 << 6,
        correctlyRoundedDivSqrt = 1 << 7,
    }
    
    enum MemoryCacheType : cl_uint
    {
        none = 0,
        readOnly = 1,
        readWrite = 2,
    }
    
    enum LocalMemoryType : cl_uint
    {
        local,
        global,
    }
    
    enum ExecutionCapabilities : cl_bitfield
    {
        kernel,
        nativeKernel,
    }
    
    static struct Info
    {
        @(0x1000) Type type;
        @(0x1001) uint vendorID;
        @(0x1002) uint maxComputeUnits;
        @(0x1003) uint _maxWorkItemDimensions;
        @(0x1004) size_t maxWorkGroupSize;
        @(0x1005) size_t* _maxWorkItemSizes;
        ArrayAccesssor!(_maxWorkItemSizes,_maxWorkItemDimensions) maxWorkItems;
        @(0x1006) uint preferredVectorWidthByte;
        @(0x1007) uint preferredVectorWidthShort;
        @(0x1008) uint preferredVectorWidthInt;
        @(0x1009) uint preferredVectorWidthLong;
        @(0x100A) uint preferredVectorWidthFloat;
        @(0x100B) uint preferredVectorWidthDouble;
        @(0x100C) uint maxClockFrequency;
        @(0x1000) uint addressBits;
        @(0x100E) uint maxReadImageArgs;
        @(0x100F) uint maxWriteImageArgs;
        @(0x1010) ulong maxMemoryAllocSize;
        @(0x1011) size_t image2DMaxWidth;
        @(0x1012) size_t image2DMaxHeight;
        @(0x1013) size_t image3DMaxWidth;
        @(0x1014) size_t image3DMaxHeight;
        @(0x1015) size_t image3DMaxDepth;
        @(0x1016) bool imageSupport;
        @(0x1017) size_t maxParameterSize;
        @(0x1018) uint maxSamplers;
        @(0x1019) uint memeoryBaseAddressAlign;
        @(0x101A) uint minDataTypeAlignSize;        // Deprecated in OpenCl 1.2
        @(0x101B) FPConfig floatFPConfig;
        @(0x101C) MemoryCacheType GLobalMemoryCacheType;
        @(0x101D) uint  globalMemoryCachelineSize;
        @(0x101E) ulong globalMemoryCacheSize;
        @(0x101F) ulong globalMemorySize;
        @(0x1020) ulong maxConstantBufferSize;
        @(0x1021) uint  maxConstantArgs;
        @(0x1022) LocalMemoryType localMemoryType;
        @(0x1023) ulong localMemorySize;
        @(0x1024) bool errorCorrectionSupport;
        @(0x1025) size_t profilingTimerResolution;
        @(0x1026) bool endianLittle;
        @(0x1027) bool available;
        @(0x1028) bool compilerAvailable;
        @(0x1029) ExecutionCapabilities executionCapabilities;
        @(0x102A) Queue.Properties queueProperties;
        @(0x102B) char* _name;
        @(0x102C) char* _vendor;
        @(0x102D) char* _driverVersion;
        @(0x102E) char* _profile;
        @(0x102F) char* _deviceVersion;
        @(0x1030) char* _extensions;
        
        StringzAccessor!(_name) name;
        StringzAccessor!(_vendor) vendor;
        StringzAccessor!(_driverVersion) driverVersion;
        StringzAccessor!(_profile) profile;
        StringzAccessor!(_deviceVersion) deviceVersion;
        StringzAccessor!(_extensions) extensions;
        
        @(0x1031) Platform platform;
        @(0x1032) FPConfig doubleFPConfig;
        @(0x1033) FPConfig halfFPConfig;
        @(0x1034) uint pefferedVectorWidthHalf;
        @(0x1035) bool hostUnifiedMemory;
        @(0x1036) uint nativeVectorWidthByte;
        @(0x1037) uint nativeVectorWidthShort;
        @(0x1038) uint nativeVectorWidthInt;
        @(0x1039) uint nativeVectorWidthLong;
        @(0x103A) uint nativeVectorWidthFloat;
        @(0x103B) uint nativeVectorWidthDouble;
        @(0x103C) uint nativeVectorWidthHalf;
        @(0x103D) char* _OpenCLCVersion;
        StringzAccessor!(_OpenCLCVersion) OpenCLCVersion;
        @(0x103E) bool linkerAvailable;
        @(0x103F) char* _builtinKernels;
        StringzAccessor!(_builtinKernels) builtinKernels;
        @(0x1040) size_t imageMaxBufferSize;
        @(0x1041) size_t imageMaxArraySize;
        @(0x1042) Device parentDevice;
        @(0x1043) uint partitionMaxSubDevices;
        //@(0x1044) PartitionProperty* _partitionProperties;
        //ZeroTerminatedArrayAccessor!(_partitionProperties) partitionProperties;
        @(0x1045) AffinityDomain partitionAffinityDomain;
        //@(0x1046) PartitionProperty* _partitionType;
        //ZeroTerminatedArrayAccessor!(_partitionType) partitionType;
        @(0x1047) uint peferenceCount;
        @(0x1048) bool prefferedInteropUserSync;
        @(0x1049) size_t printfBufferSize;
        
        // Extensions
        //@(0x200F) khrTeminateCapability;
        //@(0x4000) nvComputeCapabilityMajor;
        //@(0x4001) nvComputeCapabilityMinor;
        //@(0x4002) nvRegistersPerBlock;
        //@(0x4003) nvWarpSize;
        //@(0x4004) nvGPUOverlap;
        //@(0x4005) nvKerenlExecTimeout;
        //@(0x4006) nvIntegratedMemory;
        
        //@(0x4036) amdProfilingTimerOffset
    }
    
    cl_device_id raw;

    mixin(generateGetInfo!(Info,clGetDeviceInfo));
    
    //Is this a double call function? Also what to do about properties
    //its zero terminated an can contain numbers
    //see http://www.khronos.org/registry/cl/sdk/1.2/docs/man/xhtml/clCreateSubDevices.html under the examples
    /*Device[] createSubDevices(cl.device_partition_property[] properites
                                cl.uint_ numSubDevices)
    {
        
    }
    */

    void retain()
    {
        status = cast(Status)clRetainDevice(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseDevice(raw);
        checkErrors();
    }
    
}


================================================
FILE: source/dcompute/driver/ocl/event.d
================================================
module dcompute.driver.ocl.event;

import dcompute.driver.ocl;

struct Event
{
    cl_event raw;
    enum EnqueuedCommand
    {
        kernel            = 0x11F0,
        task              = 0x11F1,
        nativeKernel      = 0x11F2,
        bufferRead        = 0x11F3,
        bufferWrite       = 0x11F4,
        bufferCopy        = 0x11F5,
        imageRead         = 0x11F6,
        imageWrite        = 0x11F7,
        imageCopy         = 0x11F8,
        imageToBufferCopy = 0x11F9,
        bufferToImageCopy = 0x11FA,
        bufferMap         = 0x11FB,
        imageMap          = 0x11FC,
        unmap             = 0x11FD,
        marker            = 0x11FE,
        acquireGLObjects  = 0x11FF,
        releaseGLObjects  = 0x1200,
        bufferRectRead    = 0x1201,
        bufferRectWrite   = 0x1202,
        bufferRectCopy    = 0x1203,
        user              = 0x1204,
        barrier           = 0x1205,
        migrate           = 0x1206,
        bufferFill        = 0x1207,
        imageFill         = 0x1208,
        
        // Extensions
        acquireD3D10Objects = 0x4017,
        releaseD3D10Objects = 0x4018,
        acquireDX9MediaSurfaces = 0x202B,
        releaseDX9MediaSurfaces = 0x202C,
        acquireD3D11Objects = 0x4020,
        releaseD3D11Objects = 0x4021,
        GLFenceSyncObject   = 0x200D,
        EGLFenceSyncObject  = 0x202F,
        acquireEGLObjects   = 0x202D,
        releaseEGLObjects   = 0x202E,

    }
    
    
    enum EcexutionStatus
    {
        complete  = 0x0,
        running   = 0x1,
        submitted = 0x2,
        queued    = 0x3,
    }
    static struct Info
    {
        @(0x11D0) Queue queue;
        @(0x11D1) EnqueuedCommand type;
        @(0x11D2) uint referenceCount;
        @(0x11D3) EcexutionStatus status;
        @(0x11D4) Context context;
    }
    //mixin(generateGetInfo!(Info,clGetEventInfo));
    
    void retain()
    {
        status = cast(Status)clRetainEvent(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseEvent(raw);
        checkErrors();
    }
    void wait()
    {
        clWaitForEvents(1,&raw);
    }
}

void wait(Event[] e)
{
    clWaitForEvents(cast(uint)e.length,cast(cl_event*)e.ptr);
}


================================================
FILE: source/dcompute/driver/ocl/image.d
================================================
module dcompute.driver.ocl.image;

import dcompute.driver.ocl;
struct Image
{
    cl_mem raw;
    
    enum ChannelOrder
    {
        r            = 0x10B0,
        a            = 0x10B1,
        rg           = 0x10B2,
        ra           = 0x10B3,
        rgb          = 0x10B4,
        rgba         = 0x10B5,
        bgra         = 0x10B6,
        argb         = 0x10B7,
        intesity     = 0x10B8,
        luminance    = 0x10B9,
        Rx           = 0x10BA,
        RGx          = 0x10BB,
        RGBx         = 0x10BC,
        depth        = 0x10BD,
        depthStencil = 0x10BE,
    }
    
    enum  ChannelType
    {
        snormInt8      = 0x10D0,
        snormInt16     = 0x10D1,
        unormInt8      = 0x10D2,
        unormInt16     = 0x10D3,
        uormShort565   = 0x10D4,
        uormShort555   = 0x10D5,
        unormInt101010 = 0x10D6,
        byte_          = 0x10D7,
        short_         = 0x10D8,
        int_           = 0x10D9,
        ubyte_         = 0x10DA,
        ushort_        = 0x10DB,
        uint_          = 0x10DC,
        half_          = 0x10DD,
        float_         = 0x10DE,
        unormInt24     = 0x10DF,
    }
    static struct Format
    {
        ChannelOrder order;
        ChannelType  type;
    }
    static struct Info
    {
        @(0x1110) Format format;
        @(0x1111) size_t elementSize;
        @(0x1112) size_t rowPitch;
        @(0x1113) size_t slicePitch;
        @(0x1114) size_t width;
        @(0x1115) size_t height;
        @(0x1116) size_t depth;
        @(0x1117) size_t arraySize;
        @(0x1118) Memory memory;
        @(0x1119) uint mipLevels;
        @(0x111A) uint samples;
        
        // Extensions
        //@(0x4016) D3D10_SUBRESOURCE_KHR
        //@(0x401F) D3D11_SUBRESOURCE_KHR
        //@(0x202A) DX9_MEDIA_PLANE_KHR
    }
    //mixin(generateGetInfo!(Info,clGetImageInfo));
}


================================================
FILE: source/dcompute/driver/ocl/kernel.d
================================================
module dcompute.driver.ocl.kernel;

import dcompute.driver.ocl;

struct Kernel(F) if (is(F == function) || is(F==void))
{
    cl_kernel raw;
    
    static struct Info
    {
        @(0x1190) immutable char* _name;
        StringzAccessor!(_name) name;
        @(0x1191) uint numArgs;
        @(0x1192) uint referenceCount;
        @(0x1193) Context context;
        @(0x1194) Program program;
        @(0x1195) immutable char* _attributes;
        StringzAccessor!(_attributes) attributes;
    }
    //mixin(generateGetInfo!(Info,clGetKernelInfo));
    void retain()
    {
        status = cast(Status)clRetainKernel(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseKernel(raw);
        checkErrors();
    }
    
    void setArg(T)(uint index, T val, const bool isPrivate = false)
    {
		static if (__traits(hasMember, T, "raw")) {
			status = cast(Status)clSetKernelArg(this.raw, index, cl_mem.sizeof, (isPrivate ? null : &val.raw));
		} else {
			status = cast(Status)clSetKernelArg(this.raw, index, T.sizeof, (isPrivate ? null : &val));
		}
        checkErrors();
    }
}

struct Arg
{
    Kernel!void kernel;
    uint argIndex;
    enum AddressQualifier
    {
        global   = 0x119B,
        local    = 0x119C,
        constant = 0x119D,
        private_ = 0x119E,
    }
    
    enum AccessQualifier
    {
        readOnly  = 0x11A0,
        writeOnly = 0x11A1,
        readWrite = 0x11A2,
        none      = 0x11A3,
    }
    
    enum TypeQualifier
    {
        none     = 0,
        const_   = 1 << 0,
        restrict = 1 << 1,
        volatile = 1 << 2,
    }
    
    static struct Info
    {
        @(0x1196) AddressQualifier addressQualifier;
        @(0x1197) AccessQualifier accessQualifier;
        @(0x1198) immutable char* _typeName;
        StringzAccessor!(_typeName) typeName;
        @(0x1199) TypeQualifier typeQualifier;
        @(0x119A) immutable char* _name;
        StringzAccessor!(_name) name;
    }
    
    //mixin(generateGetInfo!(Info,clGetKernelArgInfo,"kernel.raw,argIndex"));
}

struct WorkGroup
{
    Kernel!void kernel;
    Device device;
    static struct Info
    {
        @(0x11B0) size_t workGroupSize;
        @(0x11B1) size_t[3] requiredWorkGroupSize;
        @(0x11B2) ulong localMemorySize;
        @(0x11B3) size_t preferredWorkGroupSizeMultiple;
        @(0x11B4) ulong privateMemSize;
        @(0x11B5) size_t[3] globalWorkSize;
    }
    
    //mixin(generateGetInfo!(Info,clGetKernelWorkGroupInfo,"kernel.raw,device.raw"));
}


================================================
FILE: source/dcompute/driver/ocl/memory.d
================================================
module dcompute.driver.ocl.memory;

import dcompute.driver.ocl;

struct Memory
{
    enum Type
    {
        buffer         = 0x10F0,
        image2D        = 0x10F1,
        image3D        = 0x10F2,
        image2Darray   = 0x10F3,
        image1D        = 0x10F4,
        image1Darray   = 0x10F5,
        image1Dbuffer = 0x10F6,
    }
    
    enum Flags
    {
        none                = 0,
        readWrite           = 1 << 0,
        writeOnly           = 1 << 1,
        readOnly            = 1 << 2,
        useHostPointer      = 1 << 3,
        allocateHostPointer = 1 << 4,
        copyHostPointer     = 1 << 5,
        //reserved            1 << 6,
        hostReadWrite       = 1 << 7,
        hostReadOnly        = 1 << 8,
        hostNoAccess        = 1 << 9,
    }
    
    static struct Info
    {
        @(0x1100) Type type;
        @(0x1101) Flags flags;
        @(0x1102) size_t size;
        @(0x1103) void* hostPointer;
        @(0x1104) uint mapCount;
        @(0x1105) uint referenceCount;
        @(0x1106) Context context;
        @(0x1107) Memory associatedMemory;
        @(0x1108) size_t offset;
        
        // Extensions
        //@(0x4015) D3D10_RESOURCE_KHR
        //@(0x401E) D3D10_RESOURCE_KHR
        //@(0x2028) DX9_MEDIA_ADAPTER_TYPE_KHR
        //@(0x2029) DX9_MEDIA_SURFACE_INFO_KHR
    }
    cl_mem raw;
    
    //mixin(generateGetInfo!(Info,clGetMemObjectInfo));
    void retain()
    {
        status = cast(Status)clRetainMemObject(raw);
        checkErrors();
    }
    void release()
    {
        status = cast(Status)clReleaseMemObject(raw);
        checkErrors();
    }
}


================================================
FILE: source/dcompute/driver/ocl/package.d
================================================
module dcompute.driver.ocl;

public import dcompute.driver.error;

public import dcompute.driver.ocl.buffer;
public import dcompute.driver.ocl.context;
public import dcompute.driver.ocl.device;
public import dcompute.driver.ocl.event;
public import dcompute.driver.ocl.image;
public import dcompute.driver.ocl.kernel;
public import dcompute.driver.ocl.memory;
public import dcompute.driver.ocl.platform;
public import dcompute.driver.ocl.program;
public import dcompute.driver.ocl.queue;
public import dcompute.driver.ocl.raw;
public import dcompute.driver.ocl.sampler;
public import dcompute.driver.ocl.util;


================================================
FILE: source/dcompute/driver/ocl/platform.d
================================================
module dcompute.driver.ocl.platform;

import dcompute.driver.ocl;
import std.experimental.allocator.typed;
import std.meta: AliasSeq;

struct Platform
{
	static void initialise()
	{
		DerelictCL.load();
	}
    static struct Info
    {
        @(0x0900) immutable(char)* _profile;
        @(0x0901) immutable(char)* _version_;
        @(0x0902) immutable(char)* _name;
        @(0x0903) immutable(char)* _vendor;
        @(0x0904) immutable(char)* _extensions;
        StringzAccessor!(_profile) profile;
        StringzAccessor!(_version_) version_;
        StringzAccessor!(_name) name;
        StringzAccessor!(_vendor) vendor;
        StringzAccessor!(_extensions) extensions;
        // Extensions
        //@(0x0920) khrICDSuffix;

    }

    mixin(generateGetInfo!(Info,clGetPlatformInfo));

    cl_platform_id raw;
    static Platform[] getPlatforms(A)(A a)
    {
        auto allocator = TypedAllocator!(A)(a);
        cl_uint numPlatforms;
        status = cast(Status)clGetPlatformIDs(0,null,&numPlatforms);
        checkErrors();
        cl_platform_id[] ret = allocator.makeArray!(cl_platform_id)(numPlatforms);
        status = cast(Status)clGetPlatformIDs(numPlatforms,cast(cl_platform_id*)ret.ptr,null);
        checkErrors();
        return cast(Platform[])ret;
    }
    
    Device[] getDevices(A)(A a,Device.Type device_type = Device.Type.all)
    {
        auto allocator = TypedAllocator!(A)(a);
        uint numDevices;
        status = cast(Status)clGetDeviceIDs(
            raw,
            cast(cl_device_type)device_type,
            0,
            null,
            &numDevices);
        
        auto deviceIDs = allocator.makeArray!cl_device_id(numDevices);
        
        status = cast(Status)clGetDeviceIDs(
            raw,
            cast(cl_device_type)device_type,
            numDevices,
            deviceIDs.ptr,
            null);
        
        return cast(Device[])deviceIDs;
    }
    
    // clGetExtensionFunctionAddressForPlatform
}


================================================
FILE: source/dcompute/driver/ocl/program.d
================================================
module dcompute.driver.ocl.program;

import dcompute.driver.ocl;
import std.meta: AliasSeq;
import std.string : toStringz;

struct Program
{
    static struct Info
    {
        @(0x1160) uint referneceCount;
        @(0x1161) Context context;
        
        @(0x1162) uint _numDevices;
        @(0x1163) Device* _devices;
        ArrayAccesssor!(_devices,_numDevices) devices;
        
        @(0x1164) char* _source;
        StringzAccessor!(_source) source;
        
        @(0x1165) size_t* _binarySizes;
        @(0x1166) ubyte** _binaries;
        @(0x1167) size_t  _numKernels;
        ArrayAccesssor2D!(_binaries,_binarySizes,_numKernels) binaries;
        
        @(0x1168) char* _kernelNames;
        StringzAccessor!(_kernelNames) kernelNames;
    }
    static Program globalProgram;
    cl_program raw;
    mixin(generateGetInfo!(Info,clGetProgramInfo));
    void retain()
    {
        status = cast(Status)clRetainProgram(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseProgram(raw);
        checkErrors();
    }
    void build(Device[] devices, string options)
    {
        status = cast(Status)clBuildProgram(raw,
                                cast(uint)devices.length,cast(cl_device_id*)devices.ptr,
                                options.toStringz,
                                null,null);
        checkErrors();
    }
    
    Kernel!(typeof(sym)) getKernel(alias sym)()
    {
        Kernel!void ret = getKernel(sym.mangleof);
        return cast(typeof(return))ret;
    }
    Kernel!void getKernel(string name)
    {
        Kernel!void ret;
        ret.raw = clCreateKernel(this.raw,name.toStringz,cast(int*)&status);
        checkErrors();
        return ret;
    }
    
}



struct Build
{
    Program program;
    Device  device;
    enum  BinaryType
    {
        none       = 0x0,
        object     = 0x1,
        library    = 0x2,
        executable = 0x4,
    }
    
    enum Status
    {
        success    =  0,
        none       = -1,
        error      = -2,
        inProgress = -3,
    }
    
    static struct Info
    {
        @(0x1181) Status status;
        @(0x1182) char* _options;
        StringzAccessor!(_options) options;
        @(0x1183) char* _log;
        StringzAccessor!(_log) log;
        @(0x1184) BinaryType binaryType;
    }
    mixin(generateGetInfo!(Info,clGetProgramBuildInfo,"program.raw,device.raw"));
}


================================================
FILE: source/dcompute/driver/ocl/queue.d
================================================
module dcompute.driver.ocl.queue;

import dcompute.driver.ocl;
import dcompute.driver.util;
import std.typecons;

enum MapBufferFlags
{
    read                  = 1 << 0,
    write                 = 1 << 1,
    writeInvaildateRegion = 1 << 2,
}

enum  MemoryMigrationFlags
{
    host             = 1 << 0,
    contentUndefined = 1 << 1,
}

struct Queue
{
    cl_command_queue raw;
    // constructed from context
    
    enum Properties : cl_bitfield
    {
        outOfOrderExecution = 1 << 0,
        profiling = 1 << 1
    }
    static struct Info
    {
        @(0x1090) Context context;
        @(0x1091) Device device;
        @(0x1092) uint referenceCount;
        @(0x1093) Properties properties;
    }
    
    //mixin(generateGetInfo!(Info,clGetCommandQueueInfo));
    
    void retain()
    {
        status = cast(Status)clRetainCommandQueue(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseCommandQueue(raw);
        checkErrors();
    }
    
    Event write(T)(Buffer!T buffer, T[] data,
                   Flag!"Blocking" blocking = Yes.Blocking,
                   size_t offset = 0, const Event[] waitList = null)
    {
        Event ret;
        status = cast(Status)clEnqueueWriteBuffer(this.raw, buffer.raw, cast(cl_bool)blocking, offset,
                                      data.memSize, cast(void*)data.ptr,
                                      cast(cl_uint)waitList.length, cast(cl_event*)waitList.ptr,
                                      &ret.raw);
        checkErrors();
        return ret;
                    
    }
    
    Event read(T)(Buffer!T buffer, T[] data,
                  Flag!"Blocking" blocking = Yes.Blocking,
                  size_t offset = 0, const Event[] waitList = null)
    {
        Event ret;
        status = cast(Status)clEnqueueReadBuffer(this.raw, buffer.raw, cast(cl_bool)blocking, offset,
                                     data.memSize, cast(void*)data.ptr,
                                     cast(cl_uint)waitList.length, cast(cl_event*)waitList.ptr,
                                     &ret.raw);
        checkErrors();
        return ret;
    }
    
    auto enqueue(alias k)(const size_t[] globalWorkSize,
                        const size_t[] globalWorkOffset = null, const size_t[] localWorkSize = null,
                        const Event[] waitList = null)
    in
    {
        if(globalWorkOffset)
            assert(globalWorkSize.length == globalWorkOffset.length);
        if(localWorkSize)
            assert(globalWorkSize.length == localWorkSize.length);
    }
    do
    {
        static struct Call
        {
            Queue q;
            const size_t[] globalWorkSize, globalWorkOffset,localWorkSize;
            const Event[] waitList;
            this(Queue _q,const size_t[] a, const size_t[] b, const size_t[] c, const Event[] d)
            {
                q = _q;
                globalWorkSize = a;
                globalWorkOffset = b;
                localWorkSize = c;
                waitList = d;
            }
            Event opCall(HostArgsOf!(typeof(k)) args)
            {
                auto kernel = Program.globalProgram.getKernel!k();
                foreach(uint i, a; args)
                {
                    kernel.setArg(i,a);
                }
                Event e;
                clEnqueueNDRangeKernel(q.raw, kernel.raw,
                                       cast(uint)globalWorkSize.length,
                                       globalWorkOffset.ptr, globalWorkSize.ptr, localWorkSize.ptr,
                                       cast(uint)waitList.length, cast(cl_event*)waitList.ptr,
                                       &e.raw);
                kernel.release();
                return e;
            }
        }
        
        return Call(this,globalWorkSize,globalWorkOffset,localWorkSize,waitList);
    }
    
    Queue flush()
    {
        clFlush(this.raw);
        return this;
    }

    Queue finish()
    {
        clFinish(this.raw);
        return this;
    }
    //TODO: fill, copy, marker, barrier [, rectFill, rect Copy]

    
}


================================================
FILE: source/dcompute/driver/ocl/raw/enums.d
================================================
module dcompute.driver.ocl.raw.enums;

import dcompute.driver.ocl;

enum //: profiling_info
{
    PROFILING_COMMAND_QUEUED = 0x1280,
    PROFILING_COMMAND_SUBMIT = 0x1281,
    PROFILING_COMMAND_START  = 0x1282,
    PROFILING_COMMAND_END    = 0x1283,
}

// device_partition_property_ext extension
enum
{
    DEVICE_PARTITION_EQUALLY_EXT             = 0x4050,
    DEVICE_PARTITION_BY_COUNTS_EXT           = 0x4051,
    DEVICE_PARTITION_BY_NAMES_EXT            = 0x4052,
    DEVICE_PARTITION_BY_AFFINITY_DOMAIN_EXT  = 0x4053,
}

// clDeviceGetInfo selectors
enum
{
    DEVICE_PARENT_DEVICE_EXT                 = 0x4054,
    DEVICE_PARTITION_TYPES_EXT               = 0x4055,
    DEVICE_AFFINITY_DOMAINS_EXT              = 0x4056,
    DEVICE_REFERENCE_COUNT_EXT               = 0x4057,
    DEVICE_PARTITION_STYLE_EXT               = 0x4058,
}

// AFFINITY_DOMAINs
enum
{
    AFFINITY_DOMAIN_L1_CACHE_EXT             = 0x1,
    AFFINITY_DOMAIN_L2_CACHE_EXT             = 0x2,
    AFFINITY_DOMAIN_L3_CACHE_EXT             = 0x3,
    AFFINITY_DOMAIN_L4_CACHE_EXT             = 0x4,
    AFFINITY_DOMAIN_NUMA_EXT                 = 0x10,
    AFFINITY_DOMAIN_NEXT_FISSIONABLE_EXT     = 0x100,
}

// device_partition_property_ext list terminators
enum
{
    PROPERTIES_LIST_END_EXT          =  0,
    PARTITION_BY_COUNTS_LIST_END_EXT =  0,
    PARTITION_BY_NAMES_LIST_END_EXT  =  0 - 1,
}


// egl.h

// gl.h

// gl_object_type
enum
{
    GL_OBJECT_BUFFER                         = 0x2000,
    GL_OBJECT_TEXTURE2D                      = 0x2001,
    GL_OBJECT_TEXTURE3D                      = 0x2002,
    GL_OBJECT_RENDERBUFFER                   = 0x2003,
    GL_OBJECT_TEXTURE2D_ARRAY                = 0x200E,
    GL_OBJECT_TEXTURE1D                      = 0x200F,
    GL_OBJECT_TEXTURE1D_ARRAY                = 0x2010,
    GL_OBJECT_TEXTURE_BUFFER                 = 0x2011,
}

// gl_texture_info
enum
{
    GL_TEXTURE_TARGET                        = 0x2004,
    GL_MIPMAP_LEVEL                          = 0x2005,
    GL_NUM_SAMPLES                           = 0x2012,
}

// gl_context_info
enum
{
    CURRENT_DEVICE_FOR_GL_CONTEXT_KHR        = 0x2006,
    DEVICES_FOR_GL_CONTEXT_KHR               = 0x2007,
}


// d3d10_device_source_nv
enum
{
    D3D10_DEVICE_KHR                             = 0x4010,
    D3D10_DXGI_ADAPTER_KHR                       = 0x4011,
}

// d3d10_device_set_nv
enum
{
    PREFERRED_DEVICES_FOR_D3D10_KHR              = 0x4012,
    ALL_DEVICES_FOR_D3D10_KHR                    = 0x4013,
}

// d3d11_device_source
enum
{
    D3D11_DEVICE_KHR                             = 0x4019,
    D3D11_DXGI_ADAPTER_KHR                       = 0x401A,
}

// d3d11_device_set
enum
{
    PREFERRED_DEVICES_FOR_D3D11_KHR              = 0x401B,
    ALL_DEVICES_FOR_D3D11_KHR                    = 0x401C,
}

// media_adapter_type_khr
enum
{
    ADAPTER_D3D9_KHR                             = 0x2020,
    ADAPTER_D3D9EX_KHR                           = 0x2021,
    ADAPTER_DXVA_KHR                             = 0x2022,
}

// media_adapter_set_khr
enum
{
    PREFERRED_DEVICES_FOR_DX9_MEDIA_ADAPTER_KHR  = 0x2023,
    ALL_DEVICES_FOR_DX9_MEDIA_ADAPTER_KHR        = 0x2024,
}



================================================
FILE: source/dcompute/driver/ocl/raw/functions.d
================================================
module dcompute.driver.ocl.raw.functions;

//This is an autogenerated file, do not edit


import dcompute.driver.ocl;
//nothrow: @nogc:

/*
auto getEventProfilingInfo(event a, profiling_info b, size_t c, void* d, size_t* e)
{
    debug assert(clGetEventProfilingInfo);
    auto ret = cast(int)clGetEventProfilingInfo(cast(cl_event)a, cast(cl_profiling_info)b, cast(size_t)c, cast(void*)d, cast(size_t*)e);
    return ret;
}

auto enqueueCopyBuffer(command_queue a, mem b, mem c, size_t d, size_t e, size_t f, uint g, const(event*) h, event* i)
{
    debug assert(clEnqueueCopyBuffer);
    auto ret = cast(int)clEnqueueCopyBuffer(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_mem)c, cast(size_t)d, cast(size_t)e, cast(size_t)f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i);
    return ret;
}

auto enqueueReadImage(command_queue a, mem b, bool c, const(size_t*) d, const(size_t*) e, size_t f, size_t g, void* h, uint i, const(event*) j, event* k)
{
    debug assert(clEnqueueReadImage);
    auto ret = cast(int)clEnqueueReadImage(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_bool)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(size_t)f, cast(size_t)g, cast(void*)h, cast(cl_uint)i, cast(const(cl_event*))j, cast(cl_event*)k);
    return ret;
}

auto enqueueWriteImage(command_queue a, mem b, bool c, const(size_t*) d, const(size_t*) e, size_t f, size_t g, const(void*) h, uint i, const(event*) j, event* k)
{
    debug assert(clEnqueueWriteImage);
    auto ret = cast(int)clEnqueueWriteImage(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_bool)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(size_t)f, cast(size_t)g, cast(const(void*))h, cast(cl_uint)i, cast(const(cl_event*))j, cast(cl_event*)k);
    return ret;
}

auto enqueueCopyImage(command_queue a, mem b, mem c, const(size_t*) d, const(size_t*) e, const(size_t*) f, uint g, const(event*) h, event* i)
{
    debug assert(clEnqueueCopyImage);
    auto ret = cast(int)clEnqueueCopyImage(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_mem)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(const(size_t*))f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i);
    return ret;
}

auto enqueueCopyImageToBuffer(command_queue a, mem b, mem c, const(size_t*) d, const(size_t*) e, size_t f, uint g, const(event*) h, event* i)
{
    debug assert(clEnqueueCopyImageToBuffer);
    auto ret = cast(int)clEnqueueCopyImageToBuffer(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_mem)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(size_t)f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i);
    return ret;
}

auto enqueueCopyBufferToImage(command_queue a, mem b, mem c, size_t d, const(size_t*) e, const(size_t*) f, uint g, const(event*) h, event* i)
{
    debug assert(clEnqueueCopyBufferToImage);
    auto ret = cast(int)clEnqueueCopyBufferToImage(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_mem)c, cast(size_t)d, cast(const(size_t*))e, cast(const(size_t*))f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i);
    return ret;
}

auto enqueueMapBuffer(command_queue a, mem b, bool c, map_flags d, size_t e, size_t f, uint g, const(event*) h, event* i, int* j)
{
    debug assert(clEnqueueMapBuffer);
    auto ret = cast(void*)clEnqueueMapBuffer(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_bool)c, cast(cl_map_flags)d, cast(size_t)e, cast(size_t)f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i, cast(cl_int*)j);
    return ret;
}

auto enqueueMapImage(command_queue a, mem b, bool c, map_flags d, const(size_t*) e, const(size_t*) f, size_t* g, size_t* h, uint i, const(event*) j, event* k, int* l)
{
    debug assert(clEnqueueMapImage);
    auto ret = cast(void*)clEnqueueMapImage(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_bool)c, cast(cl_map_flags)d, cast(const(size_t*))e, cast(const(size_t*))f, cast(size_t*)g, cast(size_t*)h, cast(cl_uint)i, cast(const(cl_event*))j, cast(cl_event*)k, cast(cl_int*)l);
    return ret;
}

auto enqueueUnmapMemObject(command_queue a, mem b, void* c, uint d, const(event*) e, event* f)
{
    debug assert(clEnqueueUnmapMemObject);
    auto ret = cast(int)clEnqueueUnmapMemObject(cast(cl_command_queue)a, cast(cl_mem)b, cast(void*)c, cast(cl_uint)d, cast(const(cl_event*))e, cast(cl_event*)f);
    return ret;
}

auto enqueueNDRangeKernel(command_queue a, kernel b, uint c, const(size_t*) d, const(size_t*) e, const(size_t*) f, uint g, const(event*) h, event* i)
{
    debug assert(clEnqueueNDRangeKernel);
    auto ret = cast(int)clEnqueueNDRangeKernel(cast(cl_command_queue)a, cast(cl_kernel)b, cast(cl_uint)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(const(size_t*))f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i);
    return ret;
}

auto enqueueTask(command_queue a, kernel b, uint c, const(event*) d, event* e)
{
    debug assert(clEnqueueTask);
    auto ret = cast(int)clEnqueueTask(cast(cl_command_queue)a, cast(cl_kernel)b, cast(cl_uint)c, cast(const(cl_event*))d, cast(cl_event*)e);
    return ret;
}

extern(System) alias enqueueNativeKernel_FuncAlias = void function(void*);
auto enqueueNativeKernel(command_queue a, enqueueNativeKernel_FuncAlias b, void* c, size_t d, uint e, const(mem*) f, const(void*)* g, uint h, const(event*) i, event* j)
{
    debug assert(clEnqueueNativeKernel);
    auto ret = cast(int)clEnqueueNativeKernel(cast(cl_command_queue)a, cast(enqueueNativeKernel_FuncAlias)b, cast(void*)c, cast(size_t)d, cast(cl_uint)e, cast(const(cl_mem*))f, cast(const(void*)*)g, cast(cl_uint)h, cast(const(cl_event*))i, cast(cl_event*)j);
    return ret;
}

auto setCommandQueueProperty(command_queue a, command_queue_properties b, bool c, command_queue_properties* d)
{
    debug assert(clSetCommandQueueProperty);
    auto ret = cast(int)clSetCommandQueueProperty(cast(cl_command_queue)a, cast(cl_command_queue_properties)b, cast(cl_bool)c, cast(cl_command_queue_properties*)d);
    return ret;
}

auto createSubBuffer(mem a, mem_flags b, buffer_create_type c, const(void*) d, int* e)
{
    debug assert(clCreateSubBuffer);
    auto ret = cast(mem)clCreateSubBuffer(cast(cl_mem)a, cast(cl_mem_flags)b, cast(cl_buffer_create_type)c, cast(const(void*))d, cast(cl_int*)e);
    return ret;
}

extern(System) alias setMemObjectDestructorCallback_FuncAlias = void function(cl_mem, void*);
auto setMemObjectDestructorCallback(mem a, setMemObjectDestructorCallback_FuncAlias b, void* c)
{
    debug assert(clSetMemObjectDestructorCallback);
    auto ret = cast(int)clSetMemObjectDestructorCallback(cast(cl_mem)a, cast(setMemObjectDestructorCallback_FuncAlias)b, cast(void*)c);
    return ret;
}

auto createUserEvent(context a, int* b)
{
    debug assert(clCreateUserEvent);
    auto ret = cast(event)clCreateUserEvent(cast(cl_context)a, cast(cl_int*)b);
    return ret;
}

auto setUserEventStatus(event a, int b)
{
    debug assert(clSetUserEventStatus);
    auto ret = cast(int)clSetUserEventStatus(cast(cl_event)a, cast(cl_int)b);
    return ret;
}

extern(System) alias setEventCallback_FuncAlias = void function(cl_event, cl_int, void*);
auto setEventCallback(event a, int b, setEventCallback_FuncAlias c, void* d)
{
    debug assert(clSetEventCallback);
    auto ret = cast(int)clSetEventCallback(cast(cl_event)a, cast(cl_int)b, cast(setEventCallback_FuncAlias)c, cast(void*)d);
    return ret;
}

auto enqueueReadBufferRect(command_queue a, mem b, bool c, const(size_t*) d, const(size_t*) e, const(size_t*) f, size_t g, size_t h, size_t i, size_t j, void* k, uint l, const(event*) m, event* n)
{
    debug assert(clEnqueueReadBufferRect);
    auto ret = cast(int)clEnqueueReadBufferRect(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_bool)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(const(size_t*))f, cast(size_t)g, cast(size_t)h, cast(size_t)i, cast(size_t)j, cast(void*)k, cast(cl_uint)l, cast(const(cl_event*))m, cast(cl_event*)n);
    return ret;
}

auto enqueueWriteBufferRect(command_queue a, mem b, bool c, const(size_t*) d, const(size_t*) e, const(size_t*) f, size_t g, size_t h, size_t i, size_t j, const(void*) k, uint l, const(event*) m, event* n)
{
    debug assert(clEnqueueWriteBufferRect);
    auto ret = cast(int)clEnqueueWriteBufferRect(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_bool)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(const(size_t*))f, cast(size_t)g, cast(size_t)h, cast(size_t)i, cast(size_t)j, cast(const(void*))k, cast(cl_uint)l, cast(const(cl_event*))m, cast(cl_event*)n);
    return ret;
}

auto enqueueCopyBufferRect(command_queue a, mem b, mem c, const(size_t*) d, const(size_t*) e, const(size_t*) f, size_t g, size_t h, size_t i, size_t j, uint k, const(event*) l, event* m)
{
    debug assert(clEnqueueCopyBufferRect);
    auto ret = cast(int)clEnqueueCopyBufferRect(cast(cl_command_queue)a, cast(cl_mem)b, cast(cl_mem)c, cast(const(size_t*))d, cast(const(size_t*))e, cast(const(size_t*))f, cast(size_t)g, cast(size_t)h, cast(size_t)i, cast(size_t)j, cast(cl_uint)k, cast(const(cl_event*))l, cast(cl_event*)m);
    return ret;
}

auto createImage2D(context a, mem_flags b, const(image_format*) c, size_t d, size_t e, size_t f, void* g, int* h)
{
    debug assert(clCreateImage2D);
    auto ret = cast(mem)clCreateImage2D(cast(cl_context)a, cast(cl_mem_flags)b, cast(const(cl_image_format*))c, cast(size_t)d, cast(size_t)e, cast(size_t)f, cast(void*)g, cast(cl_int*)h);
    return ret;
}

auto createImage3D(context a, mem_flags b, const(image_format*) c, size_t d, size_t e, size_t f, size_t g, size_t h, void* i, int* j)
{
    debug assert(clCreateImage3D);
    auto ret = cast(mem)clCreateImage3D(cast(cl_context)a, cast(cl_mem_flags)b, cast(const(cl_image_format*))c, cast(size_t)d, cast(size_t)e, cast(size_t)f, cast(size_t)g, cast(size_t)h, cast(void*)i, cast(cl_int*)j);
    return ret;
}

auto enqueueMarker(command_queue a, event* b)
{
    debug assert(clEnqueueMarker);
    auto ret = cast(int)clEnqueueMarker(cast(cl_command_queue)a, cast(cl_event*)b);
    return ret;
}

auto enqueueWaitForEvents(command_queue a, uint b, const(event*) c)
{
    debug assert(clEnqueueWaitForEvents);
    auto ret = cast(int)clEnqueueWaitForEvents(cast(cl_command_queue)a, cast(cl_uint)b, cast(const(cl_event*))c);
    return ret;
}

auto enqueueBarrier(command_queue a)
{
    debug assert(clEnqueueBarrier);
    auto ret = cast(int)clEnqueueBarrier(cast(cl_command_queue)a);
    return ret;
}

auto unloadCompiler()
{
    debug assert(clUnloadCompiler);
    auto ret = cast(int)clUnloadCompiler();
    return ret;
}

auto getExtensionFunctionAddress(const(char*) a)
{
    debug assert(clGetExtensionFunctionAddress);
    auto ret = cast(void*)clGetExtensionFunctionAddress(cast(const(char*))a);
    return ret;
}

auto createSubDevices(device_id a, const(device_partition_property*) b, uint c, device_id* d, uint* e)
{
    debug assert(clCreateSubDevices);
    auto ret = cast(int)clCreateSubDevices(cast(cl_device_id)a, cast(const(cl_device_partition_property*))b, cast(cl_uint)c, cast(cl_device_id*)d, cast(cl_uint*)e);
    return ret;
}

auto retainDevice(device_id a)
{
    debug assert(clRetainDevice);
    auto ret = cast(int)clRetainDevice(cast(cl_device_id)a);
    return ret;
}

auto releaseDevice(device_id a)
{
    debug assert(clReleaseDevice);
    auto ret = cast(int)clReleaseDevice(cast(cl_device_id)a);
    return ret;
}

auto createImage(context a, mem_flags b, const(image_format*) c, const(image_desc*) d, void* e, int* f)
{
    debug assert(clCreateImage);
    auto ret = cast(mem)clCreateImage(cast(cl_context)a, cast(cl_mem_flags)b, cast(const(cl_image_format*))c, cast(const(cl_image_desc*))d, cast(void*)e, cast(cl_int*)f);
    return ret;
}

extern(System) alias compileProgram_FuncAlias = void function(cl_program, void*);
auto compileProgram(program a, uint b, const(device_id*) c, const(char*) d, uint e, const(program*) f, const(char*)* g, compileProgram_FuncAlias h, void* i)
{
    debug assert(clCompileProgram);
    auto ret = cast(int)clCompileProgram(cast(cl_program)a, cast(cl_uint)b, cast(const(cl_device_id*))c, cast(const(char*))d, cast(cl_uint)e, cast(const(cl_program*))f, cast(const(char*)*)g, cast(compileProgram_FuncAlias)h, cast(void*)i);
    return ret;
}

extern(System) alias linkProgram_FuncAlias = void function(cl_program, void*);
auto linkProgram(context a, uint b, const(device_id*) c, const(char*) d, uint e, const(program*) f, linkProgram_FuncAlias g, void* h, int* i)
{
    debug assert(clLinkProgram);
    auto ret = cast(program)clLinkProgram(cast(cl_context)a, cast(cl_uint)b, cast(const(cl_device_id*))c, cast(const(char*))d, cast(cl_uint)e, cast(const(cl_program*))f, cast(linkProgram_FuncAlias)g, cast(void*)h, cast(cl_int*)i);
    return ret;
}

auto unloadPlatformCompiler(platform_id a)
{
    debug assert(clUnloadPlatformCompiler);
    auto ret = cast(int)clUnloadPlatformCompiler(cast(cl_platform_id)a);
    return ret;
}

auto enqueueFillBuffer(command_queue a, mem b, const(void*) c, size_t d, size_t e, size_t f, uint g, const(event*) h, event* i)
{
    debug assert(clEnqueueFillBuffer);
    auto ret = cast(int)clEnqueueFillBuffer(cast(cl_command_queue)a, cast(cl_mem)b, cast(const(void*))c, cast(size_t)d, cast(size_t)e, cast(size_t)f, cast(cl_uint)g, cast(const(cl_event*))h, cast(cl_event*)i);
    return ret;
}

auto enqueueFillImage(command_queue a, mem b, const(void*) c, const(size_t*) d, const(size_t*) e, uint f, const(event*) g, event* h)
{
    debug assert(clEnqueueFillImage);
    auto ret = cast(int)clEnqueueFillImage(cast(cl_command_queue)a, cast(cl_mem)b, cast(const(void*))c, cast(const(size_t*))d, cast(const(size_t*))e, cast(cl_uint)f, cast(const(cl_event*))g, cast(cl_event*)h);
    return ret;
}

auto enqueueMigrateMemObjects(command_queue a, uint b, const(mem*) c, mem_migration_flags d, uint e, const(event*) f, event* g)
{
    debug assert(clEnqueueMigrateMemObjects);
    auto ret = cast(int)clEnqueueMigrateMemObjects(cast(cl_command_queue)a, cast(cl_uint)b, cast(const(cl_mem*))c, cast(cl_mem_migration_flags)d, cast(cl_uint)e, cast(const(cl_event*))f, cast(cl_event*)g);
    return ret;
}

auto enqueueMarkerWithWaitList(command_queue a, uint b, const(event*) c, event* d)
{
    debug assert(clEnqueueMarkerWithWaitList);
    auto ret = cast(int)clEnqueueMarkerWithWaitList(cast(cl_command_queue)a, cast(cl_uint)b, cast(const(cl_event*))c, cast(cl_event*)d);
    return ret;
}

auto enqueueBarrierWithWaitList(command_queue a, uint b, const(event*) c, event* d)
{
    debug assert(clEnqueueBarrierWithWaitList);
    auto ret = cast(int)clEnqueueBarrierWithWaitList(cast(cl_command_queue)a, cast(cl_uint)b, cast(const(cl_event*))c, cast(cl_event*)d);
    return ret;
}

auto getExtensionFunctionAddressForPlatform(platform_id a, const(char*) b)
{
    debug assert(clGetExtensionFunctionAddressForPlatform);
    auto ret = cast(void*)clGetExtensionFunctionAddressForPlatform(cast(cl_platform_id)a, cast(const(char*))b);
    return ret;
}
*/


================================================
FILE: source/dcompute/driver/ocl/raw/package.d
================================================
module dcompute.driver.ocl.raw;

public import dcompute.driver.ocl.raw.functions;
public import dcompute.driver.ocl.raw.enums;
public import derelict.opencl.cl;


================================================
FILE: source/dcompute/driver/ocl/sampler.d
================================================
module dcompute.driver.ocl.sampler;

import dcompute.driver.ocl;
struct Sampler
{
    enum FilterMode
    {
        nearest = 0x1140,
        linear  = 0x1141,
    }
    
    enum AddressingMode
    {
        none           = 0x1130,
        clampToEdge    = 0x1131,
        clamp          = 0x1132,
        repeat         = 0x1133,
        mirroredRepeat = 0x1134,
    }
    static struct Info
    {
        @(0x1150) uint referenceCount;
        @(0x1151) Context context;
        @(0x1152) bool normalisedCoordinates; // CHECKME is this actually a bool?
        @(0x1153) AddressingMode addressingMode;
        @(0x1154) FilterMode filterMode;
    }

    cl_sampler raw;
    
    //mixin(generateGetInfo!(Info,clGetSamplerInfo));
    void retain()
    {
        status = cast(Status)clRetainSampler(raw);
        checkErrors();
    }
    
    void release()
    {
        status = cast(Status)clReleaseSampler(raw);
        checkErrors();
    }
    
}


================================================
FILE: source/dcompute/driver/ocl/util.d
================================================
module dcompute.driver.ocl.util;

import std.range;
import std.meta;
import std.traits;

//deal with arrays seperately, in part to avoid any
//narrow-string idiocy
@property auto memSize(R)(R r)
if (is(R : T[], T))
{
    static if (is(R : T[], T))
        return r.length * T.sizeof;
    else
        static assert(false);
}

@property auto memSize(R)(R r)
if(isInputRange!R && hasLength!R && !is(R : T[], T))
{
    return r.length * (ElementType!R).sizeof;
}

T[Args.length + 1] propertyList(T,Args...)(Args args)
{
    T[Args.length + 1] props;
    foreach(i, arg; args)
        props[i] = *cast(T*)(&arg);
    props[$-1] = cast(T)0;
    return props;
}

struct ArrayAccesssor(alias ptr, alias len) {}

struct StringzAccessor(alias ptr) {}

struct ZeroTerminatedArrayAccessor(alias ptr) {}

struct ArrayAccesssor2D(alias ptr, alias lens, alias len) {}

// Returned by ArrayAccesssor2D
struct RangeOfArray(T)
{
    T**     ptr;
    size_t* lengths;
    size_t  length;
    size_t  index;

    bool empty()
    {
        return index == length;
    }

    @property T[] front()
    {
        return ptr[index][0 .. lengths[index]];
    }

    T[] opIndex(size_t i)
    {
        return ptr[i][0 .. lengths[i]];
    }
    void popFront()
    {
        ++index;
    }
    
    @property size_t opDollar() { return length; }
}

string generateGetInfo(Info,alias func,string args = "raw")()
{
    import std.string;
    return helper!(Info.tupleof).format(func.stringof,args);
}

// A substitute for fullyQualifiedName to speed up compile time
private template isModule(alias a) {
    static if (is(a) || is(typeof(a)) || a.stringof.length < 7) {
        enum isModule = false;
    } else {
        enum isModule = a.stringof[0..7] == "module ";
    }
}

private template partiallyQualifiedName(alias a) {
    static if (isModule!a) {
        enum partiallyQualifiedName = "";
    } else {
        static if (!isModule!(__traits(parent, a))) {
            enum prefix = partiallyQualifiedName!(__traits(parent, a)) ~ ".";
        } else {
            enum prefix = "";
        }
        enum partiallyQualifiedName = prefix ~ __traits(identifier, a);
    }
}

private template helper(Fields...)
{
    static if (Fields.length == 0)
        enum helper = "";

    else static if (is(typeof(Fields[0]) : ArrayAccesssor!(ptr,len),alias ptr,alias len))
    {
        enum helper = "@property " ~ typeof(*ptr).stringof ~ "[] " ~ Fields[0].stringof ~ "()\n" ~
            "{\n" ~
            "    return " ~ ptr.stringof ~ "[0 .. " ~ len.stringof ~"];"~
            "}\n" ~ helper!(Fields[1 .. $]);
    }
    else static if (is(typeof(Fields[0]) : StringzAccessor!ptr,alias ptr))
    {
        enum helper = "@property char[] " ~ Fields[0].stringof ~ "()\n" ~
            "{\n" ~
            "    import std.typecons; char[] ret;" ~
            "    size_t len;" ~
            "    %1$s(%2$s," ~ __traits(getAttributes, ptr).stringof ~ "[0], 0, null, &len);" ~
            "    ret.length = len;" ~
            "    %1$s(%2$s," ~ __traits(getAttributes, ptr).stringof ~ "[0], memSize(ret), ret.ptr, null);" ~
            "    return ret;" ~
            "}\n" ~ helper!(Fields[1 .. $]);
    }
    else static if (is(typeof(Fields[0]) : ArrayAccesssor2D!(ptr,lens,len) , alias ptr, alias lens, alias len))
    {
        enum helper = "@property RangeOfArray!(" ~ typeof(**ptr).stringof ~ ") " ~ Fields[0].stringof ~ "()\n" ~
            "{\n" ~
            "   import std.typecons; size_t length; size_t* lengths; " ~ typeof(ptr).stringof ~ " ptr;" ~
            "   %1$s(%2$s," ~ __traits(getAttributes, len).stringof ~ "[0],length.sizeof, &length,null);" ~
            "   lengths = (new size_t[length]).ptr; ptr = (new " ~ typeof(*ptr).stringof ~ "[length]).ptr;" ~
            "   %1$s(%2$s," ~ __traits(getAttributes, lens).stringof ~ "[0],lengths.sizeof, lengths,null);" ~
            "   if (lengths[length - 1] == 0) length--;" ~
            "   foreach(i; 0 .. length) \n{" ~
            "       ptr[i] = (new " ~ typeof(**ptr).stringof ~ "[lengths[i]]).ptr;" ~
            "   }\n" ~
            "   %1$s(%2$s," ~ __traits(getAttributes, ptr).stringof ~ "[0], ptr.sizeof, ptr, null);" ~
            "   return typeof(return)(ptr,lengths,length,0);" ~
            "}\n" ~ helper!(Fields[1 .. $]);
    }
    else
    {
        static if (is(typeof(Fields[0]) == enum))
        {
            enum helper = "@property " ~ partiallyQualifiedName!(typeof(Fields[0])) ~ " " ~ Fields[0].stringof ~ "()\n" ~
                "{\n" ~
                "    import std.typecons; typeof(return) ret;" ~
                "%1$s(%2$s,"~ __traits(getAttributes, Fields[0]).stringof ~ "[0], ret.sizeof, &ret, null);" ~
                "return ret; " ~ 
                "}\n" ~ helper!(Fields[1 .. $]);
    
        }
        else 
        {
            enum helper = "@property " ~ typeof(Fields[0]).stringof ~ " " ~ Fields[0].stringof ~ "()\n" ~
                "{\n" ~
                "    import std.typecons; typeof(return) ret;" ~
                "%1$s(%2$s,"~ __traits(getAttributes, Fields[0]).stringof ~ "[0], ret.sizeof, &ret, null);" ~
                "return ret; " ~ 
                "}\n" ~ helper!(Fields[1 .. $]);
        }
    }
}


================================================
FILE: source/dcompute/driver/util.d
================================================
module dcompute.driver.util;

import std.traits;
import std.meta;
import ldc.dcompute : Pointer;
import dcompute.driver.ocl.buffer : Buffer;
template HostArgsOf(F)
{
    import std.traits;
    // TODO substitute Pointer!(n,T) with Buffer!T, Image etc.
    template toBuffer(T)
    {
        static if(is(T : Pointer!(n,U), uint n,U))
            alias toBuffer = Buffer!U;
        else
            alias toBuffer = T;
    }
    alias HostArgsOf = staticMap!(toBuffer,Parameters!F);
}


================================================
FILE: source/dcompute/kernels/README.md
================================================
Algorithms
==========

Adjacent

Example use
===========

Ideally we want to be able to do something like
```D
with(kernelLaunchConfig(...)) //includes the Queue to launch on and any any other info
    T val = hostrange.array
            .transfer // transfer to device. parameters in config
            .exclusice_scan!add
            .inner_product(someOtherDeviceArray)
            .mapReduce(map_func,reduce_op)
            .retrieve;
```
and

```D
with(kernelLaunchConfig(...)) //includes the Queue device allocator to launch on and any any other info
    Event e = hostrange.array
                .transfer // transfer to device. parameters in config
                .exclusice_scan!add
                .inner_product(someOtherDeviceArray);
ErrorCode ec = e.waitAndYeild(); //play nice with fibres/threads.
```
and have the pipeline be async and return an event/error.


================================================
FILE: source/dcompute/kernels/package.d
================================================
module dcompute.kernels;
/*Adjacent:
 * adjacent!(R,alias e)(R r, R o) where e a is binary op to apply to adjacent elements of R
 *Allocator:
 *Search:
 * upper_bound
 * lower_bound
 * equal
 */

================================================
FILE: source/dcompute/std/atomic.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.atomic;

import ldc.dcompute;

import cuda = dcompute.std.cuda.atomic;
public import dcompute.std.atomic_common : MemoryOrder;

int atomicAddShared(MemoryOrder mo = MemoryOrder.seq_cst)(SharedPointer!int dst, int val)
{
	if(__dcompute_reflect(ReflectTarget.CUDA))
		return cuda.atomicAddShared!mo(dst, val);
	assert(0);
}

int atomicAdd(MemoryOrder mo = MemoryOrder.seq_cst)(GlobalPointer!int dst, int val)
{
	if(__dcompute_reflect(ReflectTarget.CUDA))
		return cuda.atomicAdd!mo(dst, val);
	assert(0);
}
/*
 * @brief Atomically exchanges the value at the address with a new value.
 * @param dst The shared memory address (passed as i64).
 * @param newVal The integer value to store (i32).
 * @return The old value that was stored at the address (i32).
 */
int atomicExchange(MemoryOrder mo = MemoryOrder.seq_cst)
                  (GlobalPointer!int dst, int newVal)
{
    if (__dcompute_reflect(ReflectTarget.CUDA))
		return cuda.atomicExchange!mo(dst, newVal);
	assert(0);
}

int atomicExchangeShared(MemoryOrder mo = MemoryOrder.seq_cst)(SharedPointer!int dst, int newVal)
{
	if(__dcompute_reflect(ReflectTarget.CUDA))
		return cuda.atomicExchangeShared!mo(dst, newVal);
	assert(0);
}

/*
 *Atomic:
 * T add (GenericPointer!T addr,T val)
 * T sub (GenericPointer!T addr,T val)
 * T xchg(GenericPointer!T addr,T val)
 * T min (GenericPointer!T addr,T val)
 * T max (GenericPointer!T addr,T val)
 * T cas (GenericPointer!T addr,T val)
 * I and (GenericPointer!I addr,I val)
 * I or  (GenericPointer!I addr,I val)
 * I xor (GenericPointer!I addr,I val)
 * I inc (GenericPointer!I addr,I val)
 * I dec (GenericPointer!I addr,I val)

 */


================================================
FILE: source/dcompute/std/atomic_common.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.atomic_common;

import ldc.dcompute;

enum MemoryOrder {
	relaxed, 
	acquire, 
	release, 
	acq_rel, 
	seq_cst
}


================================================
FILE: source/dcompute/std/cuda/atomic.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.cuda.atomic;

import ldc.dcompute;
import dcompute.std.atomic_common : MemoryOrder;

pragma(LDC_inline_ir)
    R inlineIR(string s, R, P...)(P);

int atomicAddShared(MemoryOrder mo = MemoryOrder.seq_cst)(SharedPointer!int dst, int val)
{
	static if (mo == MemoryOrder.relaxed) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw add i32 addrspace(3)* %ptr, i32 %1 monotonic
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.acquire) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw add i32 addrspace(3)* %ptr, i32 %1 acquire
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.release) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw add i32 addrspace(3)* %ptr, i32 %1 release
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.acq_rel) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw add i32 addrspace(3)* %ptr, i32 %1 acq_rel
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.seq_cst) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw add i32 addrspace(3)* %ptr, i32 %1 seq_cst
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	}
	else
		static assert(0, "atomicAddShared doesn't support memoryOrder " ~mo.stringof);
}

int atomicAdd(MemoryOrder mo = MemoryOrder.seq_cst)(GlobalPointer!int dst, int val)
{
	static if (mo == MemoryOrder.relaxed) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw add i32 addrspace(1)* %ptr, i32 %1 monotonic
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.acquire) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw add i32 addrspace(1)* %ptr, i32 %1 acquire
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.release) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw add i32 addrspace(1)* %ptr, i32 %1 release
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.acq_rel) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw add i32 addrspace(1)* %ptr, i32 %1 acq_rel
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	} else static if (mo == MemoryOrder.seq_cst) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw add i32 addrspace(1)* %ptr, i32 %1 seq_cst
			ret i32 %old`, int)(cast(long)dst, cast(int)val);
	}
	else 
		static assert(0, "atomicAdd doesn't support memoryOrder " ~mo.stringof);
}
/*
 * @brief Atomically exchanges the value at the address with a new value.
 * @param dst The shared memory address (passed as i64).
 * @param newVal The integer value to store (i32).
 * @return The old value that was stored at the address (i32).
 */
int atomicExchange(MemoryOrder mo)(GlobalPointer!int dst, int newVal)
{
    // The SharedPointer!int struct is cast to a raw long (i64) to bypass complex LDC type parsing.
	static if (mo == MemoryOrder.relaxed) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw xchg i32 addrspace(1)* %ptr, i32 %1 monotonic
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.acquire) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw xchg i32 addrspace(1)* %ptr, i32 %1 acquire
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.release) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw xchg i32 addrspace(1)* %ptr, i32 %1 release
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.acq_rel) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw xchg i32 addrspace(1)* %ptr, i32 %1 acq_rel
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.seq_cst) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(1)*
			%old = atomicrmw xchg i32 addrspace(1)* %ptr, i32 %1 seq_cst
			ret i32 %old`, int)(cast(long)dst, newVal);
	}
	else
		static assert(0, "atomicExchange doesn't support memoryOrder " ~mo.stringof);
}
int atomicExchangeShared(MemoryOrder mo = MemoryOrder.seq_cst)(SharedPointer!int dst, int newVal)
{
    // The SharedPointer!int struct is cast to a raw long (i64) to bypass complex LDC type parsing.
	static if (mo == MemoryOrder.relaxed) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw xchg i32 addrspace(3)* %ptr, i32 %1 monotonic
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.acquire) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw xchg i32 addrspace(3)* %ptr, i32 %1 acquire
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.release) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw xchg i32 addrspace(3)* %ptr, i32 %1 release
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.acq_rel) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw xchg i32 addrspace(3)* %ptr, i32 %1 acq_rel
			ret i32 %old`, int)(cast(long)dst, newVal);
	} else static if (mo == MemoryOrder.seq_cst) {
		return inlineIR!(`
			%ptr = inttoptr i64 %0 to i32 addrspace(3)*
			%old = atomicrmw xchg i32 addrspace(3)* %ptr, i32 %1 seq_cst
			ret i32 %old`, int)(cast(long)dst, newVal);
	}
	else
		static assert(0, "atomicExchangeShared doesn't support memoryOrder " ~mo.stringof);
}


================================================
FILE: source/dcompute/std/cuda/index.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.cuda.index;

import ldc.dcompute;
pure: nothrow: @nogc:
//tid = threadId
pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.tid.x")
uint tid_x();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.tid.y")
uint tid_y();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.tid.z")
uint tid_z();

//ntid = blockDim
pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.ntid.x")
uint ntid_x();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.ntid.y")
uint ntid_y();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.ntid.z")
uint ntid_z();

//ctaid = blockIdx
pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.ctaid.x")
uint ctaid_x();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.ctaid.y")
uint ctaid_y();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.ctaid.z")
uint ctaid_z();

//nctaid = gridDim
pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.nctaid.x")
uint nctaid_x();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.nctaid.y")
uint nctaid_y();

pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.nctaid.z")
uint nctaid_z();

//warpsize
pragma(LDC_intrinsic, "llvm.nvvm.read.ptx.sreg.warpsize")
uint warpsize();




================================================
FILE: source/dcompute/std/cuda/sync.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.cuda.sync;

import ldc.dcompute;
import ldc.intrinsics;

pragma(LDC_intrinsic, "llvm.nvvm.barrier0")
void barrier0();

static if (LLVM_atleast!21) { // >= LDC 1.42.0(LLVM 21)
    pragma(LDC_intrinsic, "llvm.nvvm.barrier.cta.sync.aligned.all")
    void barrier_n(int);
}

pragma(LDC_intrinsic, "llvm.nvvm.barrier0.and")
int barrier0_and(int);

pragma(LDC_intrinsic, "llvm.nvvm.barrier0.or")
int barrier0_or(int);

pragma(LDC_intrinsic, "llvm.nvvm.barrier0.popc")
int barrier0_popc(int);

//block memory barrier
pragma(LDC_intrinsic, "llvm.nvvm.membar.cta")
void membar_cta();

//device global
pragma(LDC_intrinsic, "llvm.nvvm.membar.gl")
void membar_gl();

//system global
pragma(LDC_intrinsic, "llvm.nvvm.membar.sys")
void membar_sys();



================================================
FILE: source/dcompute/std/floating.d
================================================
@compute(CompileFor.hostAndDevice) module dcompute.std.floating;

import ldc.dcompute;

/*
 *Intrinsic
 * isfinite
 * isinfinite
 * isnan
 * isnormal
 * signed
 * abs
 * ceil
 * copysign
 * fdim
 * floor
 * fma
 * fract
 * frexp
 * ilogb
 * ldexp
 * min
 * max
 * pow
 * powr
 * powi
 * trunc
 * sqrt
 * rsqrt
 *Standard Trancedentals:
 * acos
 * acosh
 * asin
 * asinh
 * atan
 * atan2
 * atanh
 * cos
 * cosh
 * cospi
 * exp
 * exp2
 * exp10
 * log
 * log2
 * log10
 * sincos
 * sin
 * sinh
 * sinpi
 * tan
 * tanh
 * tanpi
 */


================================================
FILE: source/dcompute/std/index.d
================================================
@compute(CompileFor.hostAndDevice) module dcompute.std.index;

import ldc.dcompute;

private import ocl  = dcompute.std.opencl.index;
private import cuda = dcompute.std.cuda.index;

/*
 Index Terminology
 
 DCompute               CUDA                        OpenCL
 GlobalDimension.xyz    gridDim*blockDim            get_global_size()
 GlobalIndex.xyz        blockDim*blockIdx+threadIdx get_global_id()
 
 
 GroupDimension.xyz     gridDim                     get_num_groups()
 GroupIndex.xyz         blockIdx                    get_group_id()
 
 SharedDimension.xyz    blockDim                    get_local_size()
 SharedIndex.xyz        threadIdx                   get_local_id()
 
 GlobalIndex.linear     A nasty calcualion          get_global_linear_id()
 SharedIndex.linear     Ditto                       get_local_linear_id()
 
 Notes:
    *Index.{x,y,z} are bounded by *Dimension.{x,y,z}
 
    Use SharedIndex's to index Shared Memory and GlobalIndex's to index Global Memory
 
    A Group is the ratio of Global to Shared. GroupDimension is NOT the size of a single
    group, (thats SharedDimension) rather it is the number of groups along e.g 
    the x dimension. Similarly GroupIndex is how many units of the SharedDimension along
    a given dimension is.
 
    By default *Index.linear is the linearisation of 3D memory. Use *Index.linear!N where
    N is 1, 2 or 3 to use a linearisation of ND memory (for e.g. efficiency/documentation).
 */
pure: nothrow: @nogc:

struct GlobalDimension
{
    pragma(inline,true);
    @property static size_t x()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_global_size(0);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ntid_x()*cuda.nctaid_x();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t y()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_global_size(1);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ntid_y()*cuda.nctaid_y();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t z()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_global_size(2);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ntid_z()*cuda.nctaid_z();
        else
            assert(0);
    }
}

struct GlobalIndex
{
    pragma(inline,true);
    @property static size_t x()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_global_id(0);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ctaid_x()*cuda.ntid_x() + cuda.tid_x();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t y()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_global_id(1);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ctaid_y()*cuda.ntid_y() + cuda.tid_y();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t z()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_global_id(2);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ctaid_z()*cuda.ntid_z() + cuda.tid_z();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t linearImpl(int dim = 3)()
    if(dim >= 1 && dim <= 3)
    {
        static if (dim == 3)
            return  (z * GlobalDimension.y * GlobalDimension.x) +
                    (y * GlobalDimension.x) + x;
        else static if (dim == 2)
            return (y * GlobalDimension.x) + x;
        else
            return x;
    }
    pragma(inline,true);
    @property static size_t linear(int dim = 3)() if(dim >= 1 && dim <= 3)
    {
        //Foward to the intrinsic to help memoisation for the comsumer.
        if(__dcompute_reflect(ReflectTarget.OpenCL,200))
            return ocl.get_global_linear_id();
        else if(__dcompute_reflect(ReflectTarget.OpenCL,210))
            return ocl.get_global_linear_id();
        else if(__dcompute_reflect(ReflectTarget.OpenCL,220))
            return ocl.get_global_linear_id();
        else
            return linearImpl!dim;
    }
}

struct GroupDimension
{
    pragma(inline,true);
    @property static size_t x()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_num_groups(0);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.nctaid_x();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t y()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_num_groups(1);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.nctaid_y();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t z()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_num_groups(2);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.nctaid_z();
        else
            assert(0);
    }
}

struct GroupIndex
{
    pragma(inline,true);
    @property static size_t x()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_group_id(0);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ctaid_x();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t y()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_group_id(1);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ctaid_y();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t z()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_group_id(2);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ctaid_z();
        else
            assert(0);
    }
}

struct SharedDimension
{
    pragma(inline,true);
    @property static size_t x()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_local_size(0);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ntid_x();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t y()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_local_size(1);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ntid_y();
        else
            assert(0);

    }
    pragma(inline,true);
    @property static size_t z()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_local_size(2);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.ntid_z();
        else
            assert(0);
    }
}

struct SharedIndex
{
    pragma(inline,true);
    @property static size_t x()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_local_id(0);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.tid_x();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t y()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_local_id(1);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.tid_y();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t z()()
    {
        if(__dcompute_reflect(ReflectTarget.OpenCL,0))
            return ocl.get_local_id(2);
        else if(__dcompute_reflect(ReflectTarget.CUDA,0))
            return cuda.tid_z();
        else
            assert(0);
    }
    pragma(inline,true);
    @property static size_t linearImpl(int dim = 3)()
    if(dim >= 1 && dim <= 3)
    {
        static if (dim == 3)
            return  (z * SharedDimension.y * SharedDimension.x) +
                    (y * SharedDimension.x) + x;
        else static if (dim == 2)
                return (y * SharedDimension.x) + x;
        else
            return x;

    }
    pragma(inline,true);
    @property static size_t linear(int dim = 3)() if(dim >= 1 && dim <= 3)
    {
        //Foward to the intrinsic to help memoisation for the comsumer.
        if(__dcompute_reflect(ReflectTarget.OpenCL,200))
            return ocl.get_local_linear_id();
        else if(__dcompute_reflect(ReflectTarget.OpenCL,210))
            return ocl.get_local_linear_id();
        else if(__dcompute_reflect(ReflectTarget.OpenCL,220))
            return ocl.get_local_linear_id();
        else
            return linearImpl!dim;
        
    }
}

private import std.traits;
struct AutoIndexed(T) //if (isInstanceOf(T,Pointer))
{
    T p = void;
    enum  n = TemplateArgsOf!(T)[0];
    alias U = TemplateArgsOf!(T)[1];
    static assert(n == AddrSpace.Global || n == AddrSpace.Shared);
    
    @property U index()
    {
        static if (n == AddrSpace.Global)
            return p[GlobalIndex.linear];
        else static if (n == AddrSpace.Shared)
            return p[SharedIndex.linear];

    }
    
    @property void index(U t)
    {
        static if (n == AddrSpace.Global)
            p[GlobalIndex.linear] = t;
        else static if (n == AddrSpace.Shared)
            p[SharedIndex.linear] = t;
    }
    @disable this();
    alias index this;
}


================================================
FILE: source/dcompute/std/integer.d
================================================
@compute(CompileFor.hostAndDevice) module dcompute.std.integer;

import ldc.dcompute;

/*
 brev - bit reverse
 sad  - sum of absolute differences
 abs
 min
 max
 add_sat
 sub_sat
 mul_hi
 mul_low
 mad
 mad_hi
 mad_lo
 mad_hi_sat
 mul24_hi
 mul24_lo
 mad24_hi
 mad24_lo
 mad24_hi_sat
sm2.0 or higher
 popc   - count the number of set bits
 clz    - count the number of leading zeros
 bfind  - find most significant non-sign bit
 bfe    - bit field extract
 bfi    - bit field insert
 overflow arithmetic
 ctz - count trailing zeros
 rotate
 */


================================================
FILE: source/dcompute/std/memory.d
================================================
@compute(CompileFor.hostAndDevice) module dcompute.std.memory;

import ldc.dcompute;

/*
 *Pointer conversions:
 * *Pointer!T genericPtrTo*(GenericPointer!T ptr)
 * GenericPointer!T *toGenericPtr(*Pointer!T ptr)
 *
 *Shared memory:
 * SharedPointer!T sharedStaticReserve!(T[N])
 * SharedPointer!void sharedDynamicBase();
 * auto sharedIndices!(Ts...) if(isSharedIndex!(Ts...) // (T, alias length) pairs
       see http://stackoverflow.com/questions/15435559/use-dynamic-shared-memory-allocation-for-two-different-vectors
       for what this emulates any why. Memory aligned to A = reduce!max(T.alignof)
       Returns a tuple of {SharedPointer!(align(A) T), length} "arrays"
 */


================================================
FILE: source/dcompute/std/opencl/image.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.opencl.image;

import ldc.dcompute;
//separate module for opaque image type because the backend requires it
public import ldc.opencl;

template Image(int dim)
{
    static if (dim == 1)
        alias Image = GlobalPointer!image1d_rw_t;
    else static if (dim == 2)
        alias Image = GlobalPointer!image2d_rw_t;
    else static if (dim == 3)
        alias Image = GlobalPointer!image3d_rw_t;
}
template ReadOnlyImage(int dim)
{
    static if (dim == 1)
        alias ReadOnlyImage = GlobalPointer!image1d_ro_t;
    else static if (dim == 2)
        alias ReadOnlyImage = GlobalPointer!image2d_ro_t;
    else static if (dim == 3)
        alias ReadOnlyImage = GlobalPointer!image3d_ro_t;
}
template WriteOnlyImage(int dim)
{
    static if (dim == 1)
        alias WriteOnlyImage = GlobalPointer!image1d_wo_t;
    else static if (dim == 2)
        alias WriteOnlyImage = GlobalPointer!image2d_wo_t;
    else static if (dim == 3)
        alias WriteOnlyImage = GlobalPointer!image3d_wo_t;
}
/* Sampler
    A type used to control how elements of an image object are read by read_image
    Sampler arguments to read_image must be literals.
Coordinate normalisation
    CLK_NORMALIZED_COORDS_TRUE,
    CLK_NORMALIZED_COORDS_FALSE
Addressing mode
    CLK_ADDRESS_MIRRORED_REPEAT requires CLK_NORMALIZED_COORDS_TRUE
        Flip the image coordinate at every integer junction.
        If normalized coordinates are not used, this addressing mode may
        generate image coordinates that are undefined.
        Example: cba|abcd|dcb.
    CLK_ADDRESS_REPEAT requires CLK_NORMALIZED_COORDS_TRUE
        out-of-range image coordinates are wrapped to the valid range.
        If normalized coordinates are not used, this addressing mode may
        generate image coordinates that are undefined.
        Example: bcd|abcd|abc.
    CLK_ADDRESS_CLAMP_TO_EDGE
        out-of-range image coordinates are clamped to the extent.
        Example: aaa|abcd|ddd.
    CLK_ADDRESS_CLAMP
        out-of-range image coordinates will return a border color.
        This is similar to the GL_ADDRESS_CLAMP_TO_BORDER addressing mode.
        Example: 000|abcd|000.
    CLK_ADDRESS_NONE -
        for this addressing mode the programmer guarantees that the image
        coordinates used to sample elements of the image refer to a location
        inside the image; otherwise the results are undefined.
    For 1D and 2D image arrays, the addressing mode applies only to the x and
    (x, y) coordinates. The addressing mode for the coordinate which specifies
    the array index is always CLK_ADDRESS_CLAMP_TO_EDGE
Filter mode
    CLK_FILTER_NEAREST
    CLK_FILTER_LINEAR
 */
enum SamplerAddressMode : int
{
    none = 0,
    mirroredRepeat = 0x10,
    repeat  = 0x20,
    clampToEdge = 0x30,
    clamp = 0x40,
}
enum SamplerFilterMode : int
{
    nearest = 0,
    linear = 0x100
}
int samplerInit(bool normalisedCoords, SamplerAddressMode am, SamplerFilterMode fm)()
{
    return cast(int)(coords | am | fm);
}

alias Sampler = SharedPointer!sampler_t;

pragma(mangle,"__translate_sampler_initializer")
    Sampler makeSampler(int);

// TODO: Image 1d array, Image 1d buffer, Image 2d array depth, Image 2d array
// Refer to https://github.com/KhronosGroup/SPIR-Tools/wiki/SPIR-2.0-built-in-functions#image-read-and-write-functions
// for the read/write mangles
// Alternately use https://godbolt.org with `-target spir -O0 -emit-llvm` and check the IR generated by clang

template read(T) if (is(T == float))
{
    // return type
    alias T4 = __vector(T[4]);

    pragma(mangle,"_Z11read_imagef11ocl_image1d11ocl_samplerf")
        T4 read(GlobalPointer!image1d_rw_t, Sampler, float);
    pragma(mangle,"_Z11read_imagef11ocl_image1d11ocl_sampleri")
        T4 read(GlobalPointer!image1d_rw_t, Sampler, int);

    pragma(mangle,"_Z11read_imagef14ocl_image1d_ro11ocl_samplerf")
        T4 read(GlobalPointer!image1d_ro_t, Sampler, float);
    pragma(mangle,"_Z11read_imagef14ocl_image1d_ro11ocl_sampleri")
        T4 read(GlobalPointer!image1d_ro_t, Sampler, int);

    pragma(mangle,"_Z11read_imagef11ocl_image2d11ocl_samplerDv2_f")
        T4 read(GlobalPointer!image2d_rw_t, Sampler, __vector(float[2]));
    pragma(mangle,"_Z11read_imagef11ocl_image2d11ocl_samplerDv2_i")
        T4 read(GlobalPointer!image2d_rw_t, Sampler, __vector(int[2]));

    pragma(mangle,"_Z11read_imagef14ocl_image2d_ro11ocl_samplerDv2_f")
        T4 read(GlobalPointer!image2d_ro_t, Sampler, __vector(float[2]));
    pragma(mangle,"_Z11read_imagef14ocl_image2d_ro11ocl_samplerDv2_i")
        T4 read(GlobalPointer!image2d_ro_t, Sampler, __vector(int[2]));

    pragma(mangle,"_Z11read_imagef11ocl_image3d11ocl_samplerDv4_f")
        T4 read(GlobalPointer!image3d_rw_t, Sampler, __vector(float[4]));
    pragma(mangle,"_Z11read_imagef11ocl_image3d11ocl_samplerDv4_i")
        T4 read(GlobalPointer!image3d_rw_t, Sampler, __vector(int[4]));

    pragma(mangle,"_Z11read_imagef14ocl_image3d_ro11ocl_samplerDv4_f")
        T4 read(GlobalPointer!image3d_ro_t, Sampler, __vector(float[4]));
    pragma(mangle,"_Z11read_imagef14ocl_image3d_ro11ocl_samplerDv4_i")
        T4 read(GlobalPointer!image3d_ro_t, Sampler, __vector(int[4]));
}

template read(T) if (is(T == int))
{
    // return type
    alias T4 = __vector(T[4]);
    pragma(mangle,"_Z11read_imagei11ocl_image1d11ocl_samplerf")
        T4 read(GlobalPointer!image1d_rw_t, Sampler, float);
    pragma(mangle,"_Z11read_imagei11ocl_image1d11ocl_sampleri")
        T4 read(GlobalPointer!image1d_rw_t, Sampler, int);
    
    pragma(mangle,"_Z11read_imagei14ocl_image1d_ro11ocl_samplerf")
        T4 read(GlobalPointer!image1d_ro_t, Sampler, float);
    pragma(mangle,"_Z11read_imagei14ocl_image1d_ro11ocl_sampleri")
        T4 read(GlobalPointer!image1d_ro_t, Sampler, int);

    pragma(mangle,"_Z11read_imagef11ocl_image2d11ocl_samplerDv2_f")
        T4 read(GlobalPointer!image2d_rw_t, Sampler, __vector(float[2]));
    pragma(mangle,"_Z11read_imagef11ocl_image2d11ocl_samplerDv2_i")
        T4 read(GlobalPointer!image2d_rw_t, Sampler, __vector(int[2]));

    pragma(mangle,"_Z11read_imagef14ocl_image2d_ro11ocl_samplerDv2_f")
        T4 read(GlobalPointer!image2d_ro_t, Sampler, __vector(float[2]));
    pragma(mangle,"_Z11read_imagef14ocl_image2d_ro11ocl_samplerDv2_i")
        T4 read(GlobalPointer!image2d_ro_t, Sampler, __vector(int[2]));

    pragma(mangle,"_Z11read_imagef11ocl_image3d11ocl_samplerDv4_f")
        T4 read(GlobalPointer!image3d_rw_t, Sampler, __vector(float[4]));
    pragma(mangle,"_Z11read_imagef11ocl_image3d11ocl_samplerDv4_i")
        T4 read(GlobalPointer!image3d_rw_t, Sampler, __vector(int[4]));

    pragma(mangle,"_Z11read_imagef14ocl_image3d_ro11ocl_samplerDv4_f")
        T4 read(GlobalPointer!image3d_ro_t, Sampler, __vector(float[4]));
    pragma(mangle,"_Z11read_imagef14ocl_image3d_ro11ocl_samplerDv4_i")
        T4 read(GlobalPointer!image3d_ro_t, Sampler, __vector(int[4]));
}

template read(T) if (is(T == uint))
{
    // return type
    alias T4 = __vector(T[4]);
    pragma(mangle,"_Z12read_imageui11ocl_image1d11ocl_samplerf")
        T4 read(GlobalPointer!image1d_rw_t, Sampler, float);
    pragma(mangle,"_Z12read_imageui11ocl_image1d11ocl_sampleri")
        T4 read(GlobalPointer!image1d_rw_t, Sampler, int);
    pragma(mangle,"_Z12read_imageui14ocl_image1d_ro11ocl_samplerf")
        T4 read(GlobalPointer!image1d_ro_t, Sampler, float);
    pragma(mangle,"_Z12read_imageui14ocl_image1d_ro11ocl_sampleri")
        T4 read(GlobalPointer!image1d_ro_t, Sampler, int);

    pragma(mangle,"_Z11read_imagef11ocl_image2d11ocl_samplerDv2_f")
        T4 read(GlobalPointer!image2d_rw_t, Sampler, __vector(float[2]));
    pragma(mangle,"_Z11read_imagef11ocl_image2d11ocl_samplerDv2_i")
        T4 read(GlobalPointer!image2d_rw_t, Sampler, __vector(int[2]));
    pragma(mangle,"_Z11read_imagef14ocl_image2d_ro11ocl_samplerDv2_f")
        T4 read(GlobalPointer!image2d_ro_t, Sampler, __vector(float[2]));
    pragma(mangle,"_Z11read_imagef14ocl_image2d_ro11ocl_samplerDv2_i")
        T4 read(GlobalPointer!image2d_ro_t, Sampler, __vector(int[2]));

    pragma(mangle,"_Z11read_imagef11ocl_image3d11ocl_samplerDv4_f")
        T4 read(GlobalPointer!image3d_rw_t, Sampler, __vector(float[4]));
    pragma(mangle,"_Z11read_imagef11ocl_image3d11ocl_samplerDv4_i")
        T4 read(GlobalPointer!image3d_rw_t, Sampler, __vector(int[4]));
    pragma(mangle,"_Z11read_imagef14ocl_image3d_ro11ocl_samplerDv4_f")
        T4 read(GlobalPointer!image3d_ro_t, Sampler, __vector(float[4]));
    pragma(mangle,"_Z11read_imagef14ocl_image3d_ro11ocl_samplerDv4_i")
        T4 read(GlobalPointer!image3d_ro_t, Sampler, __vector(int[4]));
}

template write(I) if (is(I==GlobalPointer!image1d_rw_t))
{
    pragma(mangle,"_Z12write_imagef11ocl_image1diDv4_f")
        void write(I,int,__vector(float[4]));
    pragma(mangle,"_Z12write_imagef11ocl_image1diDv4_i")
        void write(I,int,__vector(int[4]));
    pragma(mangle,"_Z13write_imageui11ocl_image1diDv4_j")
        void write(I,int,__vector(uint[4]));
}
template write(I) if (is(I==GlobalPointer!image1d_wo_t))
{
    pragma(mangle,"_Z12write_imagef14ocl_image1d_woiDv4_f")
        void write(I,int,__vector(float[4]));
    pragma(mangle,"_Z12write_imagef14ocl_image1d_woiDv4_i")
        void write(I,int,__vector(int[4]));
    pragma(mangle,"_Z13write_imageui14ocl_image1d_woiDv4_j")
        void write(I,int,__vector(uint[4]));
}

template write(I) if (is(I==GlobalPointer!image2d_rw_t))
{
    pragma(mangle,"_Z12write_imagef11ocl_image2dDv2_iDv4_f")
        void write(I, __vector(int[2]), __vector(float[4]));
    pragma(mangle,"_Z12write_imagei11ocl_image2dDv2_iDv4_i")
        void write(I, __vector(int[2]), __vector(int[4]));
    pragma(mangle,"_Z13write_imageui11ocl_image2dDv2_iDv4_j")
        void write(I, __vector(int[2]), __vector(uint[4]));
}
template write(I) if (is(I==GlobalPointer!image2d_wo_t))
{
    pragma(mangle,"_Z12write_imagef14ocl_image2d_woDv2_iDv4_f")
        void write(I, __vector(int[2]), __vector(float[4]));
    pragma(mangle,"_Z12write_imagei14ocl_image2d_woDv2_iDv4_i")
        void write(I, __vector(int[2]), __vector(int[4]));
    pragma(mangle,"_Z13write_imageui14ocl_image2d_woDv2_iDv4_j")
        void write(I, __vector(int[2]), __vector(uint[4]));
}

template write(I) if (is(I==GlobalPointer!image3d_rw_t))
{
    pragma(mangle,"_Z12write_imagef11ocl_image3dDv4_iDv4_f")
        void write(I,__vector(int[4]),__vector(float[4]));
    pragma(mangle,"_Z12write_imagei11ocl_image3dDv4_iDv4_i")
        void write(I,__vector(int[4]),__vector(int[4]));
    pragma(mangle,"_Z13write_imageui11ocl_image3dDv4_iDv4_j")
        void write(I,__vector(int[4]),__vector(uint[4]));
}
template write(I) if (is(I==GlobalPointer!image3d_wo_t))
{
    pragma(mangle,"_Z12write_imagef14ocl_image3d_woDv4_iDv4_f")
        void write(I, __vector(int[4]), __vector(float[4]));
    pragma(mangle,"_Z12write_imagei14ocl_image3d_woDv4_iDv4_i")
        void write(I, __vector(int[4]), __vector(int[4]));
    pragma(mangle,"_Z13write_imageui14ocl_image3d_woDv4_iDv4_j")
        void write(I, __vector(int[4]), __vector(uint[4]));
}


================================================
FILE: source/dcompute/std/opencl/index.d
================================================
@compute(CompileFor.deviceOnly) module dcompute.std.opencl.index;

import ldc.dcompute;

pure:
nothrow:
@nogc:

// These really ought to be intrinsics, but for some reason they aren't.

/**
 * Returns the number of dimensions in use. This is the
 * value given to the work_dim argument specified in
 * clEnqueueNDRangeKernel.
 * For clEnqueueTask, this returns 1.
 */
pragma(mangle,"_Z12get_work_dim")
uint get_work_dim();

/**
 * Returns the number of global work-items specified for
 * dimension identified by dimindx. This value is given by
 * the global_work_size argument to
 * clEnqueueNDRangeKernel. Valid values of dimindx
 * are 0 to get_work_dim() - 1. For other values of
 * dimindx, get_global_size() returns 1.
 * For clEnqueueTask, this always returns 1.
 */
pragma(mangle,"_Z15get_global_sizej")
size_t get_global_size(uint dimindx);

/**
 * Returns the unique global work-item ID value for
 * dimension identified by dimindx. The global work-item
 * ID specifies the work-item ID based on the number of
 * global work-items specified to execute the kernel. Valid
 * values of dimindx are 0 to get_work_dim() - 1. For
 * other values of dimindx, get_global_id() returns 0.
 * For clEnqueueTask, this returns 0.
 */
pragma(mangle,"_Z13get_global_idj")
size_t get_global_id(uint dimindx);

/**
 * Returns the number of local work-items specified in
 * dimension identified by dimindx. This value is given by
 * the local_work_size argument to
 * clEnqueueNDRangeKernel if local_work_size is not
 * NULL; otherwise the OpenCL implementation chooses
 * an appropriate local_work_size value which is returned
 * by this function. Valid values of dimindx are 0 to
 * get_work_dim() - 1. For other values of dimindx,
 * get_local_size() returns 1.
 * For clEnqueueTask, this always returns 1.
 */
pragma(mangle,"_Z14get_local_sizej")
size_t get_local_size(uint dimindx);

/**
 * Returns the unique local work-item ID i.e. a work-item
 * within a specific work-group for dimension identified by
 * dimindx. Valid values of dimindx are 0 to
 * get_work_dim() - 1. For other values of dimindx,
 * get_local_id() returns 0.
 * For clEnqueueTask, this returns 0.
 */
pragma(mangle,"_Z12get_local_idj")
 size_t get_local_id(uint dimindx);

/**
 * Returns the number of work-groups that will execute a
 * kernel for dimension identified by dimindx.
 * Valid values of dimindx are 0 to get_work_dim() - 1.
 * For other values of dimindx, get_num_groups () returns
 * 1.
 * For clEnqueueTask, this always returns 1.
 */
pragma(mangle,"_Z14get_num_groupsj")
size_t get_num_groups(uint dimindx);

/**
 * get_group_id returns the work-group ID which is a
 * number from 0 .. get_num_groups(dimindx) - 1.
 * Valid values of dimindx are 0 to get_work_dim() - 1.
 * For other values, get_group_id() returns 0.
 * For clEnqueueTask, this returns 0.
 */
pragma(mangle,"_Z12get_group_idj")
size_t get_group_id(uint dimindx);

/**
 * get_global_offset returns the offset values specified in
 * global_work_offset argument to
 * clEnqueueNDRangeKernel.
 * Valid values of dimindx are 0 to get_work_dim() - 1.
 * For other values, get_global_offset() returns 0.
 * For clEnqueueTask, this returns 0.
 */
pragma(mangle,"_Z17get_global_offsetj")
size_t get_global_offset(uint dimindx);

//pragma(mangle,"_Z15get_global_sizej")
//size_t get_enqueued_local_size(uint);
pragma(mangle,"_Z20get_global_linear_id")
size_t get_global_linear_id();
pragma(mangle,"_Z19get_local_linear_id")
size_t get_local_linear_id();



================================================
FILE: source/dcompute/std/opencl/math.d
================================================
/**
Provides access to the OpenCL C math functions and constants.

These functions are only callable from opencl kernels.
Functions taking or returning half floats and half float constants are not supported.
Standards: [6.15.2. Math Functions](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html#math-functions)$(BR)
           [The OpenCL™ C Specification](https://registry.khronos.org/OpenCL/specs/3.0-unified/html/OpenCL_C.html)
License:  [Boost License 1.0](https://boost.org/LICENSE_1_0.txt).
*/

@compute(CompileFor.deviceOnly)
module dcompute.std.opencl.math;

import ldc.dcompute;

// Constants
enum MAXFLOAT  = float.max;
enum HUGE_VALF = float.infinity;
enum INFINITY  = float.infinity;
enum NAN       = float.nan;
enum HUGE_VAL  = double.infinity;

enum FLT_DIG        = float.dig;
enum FLT_MANT_DIG   = float.mant_dig;
enum FLT_MAX_10_EXP = float.max_10_exp;
enum FLT_MAX_EXP    = float.max_exp;
enum FLT_MIN_10_EXP = float.min_10_exp;
enum FLT_MIN_EXP    = float.min_exp;
enum FLT_RADIX      = 2;
enum FLT_MAX        = float.max;
enum FLT_MIN        = float.min_normal;
enum FLT_EPSILON    = float.epsilon;

enum FP_ILOGB0   = int.min;
enum FP_ILOGBNAN = int.max;

enum M_E_F        = 2.71828182845904523536028747135266250f;
enum M_LOG2E_F    = 1.44269504088896340735992468100189214f;
enum M_LOG10E_F   = 0.434294481903251827651128918916605082f;
enum M_LN2_F      = 0.693147180559945309417232121458176568f;
enum M_LN10_F     = 2.30258509299404568401799145468436421f;
enum M_PI_F       = 3.14159265358979323846264338327950288f;
enum M_PI_2_F     = 1.57079632679489661923132169163975144f;
enum M_PI_4_F     = 0.785398163397448309615660845819875721f;
enum M_1_PI_F     = 0.318309886183790671537767526745028724f;
enum M_2_PI_F     = 0.636619772367581343075535053490057448f;
enum M_2_SQRTPI_F = 1.12837916709551257389615890312154517f;
enum M_SQRT2_F    = 1.41421356237309504880168872420969808f;
enum M_SQRT1_2_F  = 0.707106781186547524400844362104849039f;

enum DBL_DIG        = double.dig;
enum DBL_MANT_DIG   = double.mant_dig;
enum DBL_MAX_10_EXP = double.max_10_exp;
enum DBL_MAX_EXP    = double.max_exp;
enum DBL_MIN_10_EXP = double.min_10_exp;
enum DBL_MIN_EXP    = double.min_exp;
enum DBL_MAX        = double.max;
enum DBL_MIN        = double.min_normal;
enum DBL_EPSILON    = double.epsilon;

enum M_E        = 0x1.5bf0a8b145769p+1;
enum M_LOG2E    = 0x1.71547652b82fep+0;
enum M_LOG10E   = 0x1.bcb7b1526e50ep-2;
enum M_LN2      = 0x1.62e42fefa39efp-1;
enum M_LN10     = 0x1.26bb1bbb55516p+1;
enum M_PI       = 0x1.921fb54442d18p+1;
enum M_PI_2     = 0x1.921fb54442d18p+0;
enum M_PI_4     = 0x1.921fb54442d18p-1;
enum M_1_PI     = 0x1.45f306dc9c883p-2;
enum M_2_PI     = 0x1.45f306dc9c883p-1;
enum M_2_SQRTPI = 0x1.20dd750429b6dp+0;
enum M_SQRT2    = 0x1.6a09e667f3bcdp+0;
enum M_SQRT1_2  = 0x1.6a09e667f3bcdp-1;

// acos
pragma(mangle,"_Z4acosf")               float       acos(         float);
pragma(mangle,"_Z4acosDv2_f")  __vector(float[2])   acos(__vector(float[2]));
pragma(mangle,"_Z4acosDv3_f")  __vector(float[3])   acos(__vector(float[3]));
pragma(mangle,"_Z4acosDv4_f")  __vector(float[4])   acos(__vector(float[4]));
pragma(mangle,"_Z4acosDv8_f")  __vector(float[8])   acos(__vector(float[8]));
pragma(mangle,"_Z4acosDv16_f") __vector(float[16])  acos(__vector(float[16]));
pragma(mangle,"_Z4acosd")               double      acos(         double);
pragma(mangle,"_Z4acosDv2_d")  __vector(double[2])  acos(__vector(double[2]));
pragma(mangle,"_Z4acosDv3_d")  __vector(double[3])  acos(__vector(double[3]));
pragma(mangle,"_Z4acosDv4_d")  __vector(double[4])  acos(__vector(double[4]));
pragma(mangle,"_Z4acosDv8_d")  __vector(double[8])  acos(__vector(double[8]));
pragma(mangle,"_Z4acosDv16_d") __vector(double[16]) acos(__vector(double[16]));

// acosh
pragma(mangle,"_Z5acoshf")               float       acosh(         float);
pragma(mangle,"_Z5acoshDv2_f")  __vector(float[2])   acosh(__vector(float[2]));
pragma(mangle,"_Z5acoshDv3_f")  __vector(float[3])   acosh(__vector(float[3]));
pragma(mangle,"_Z5acoshDv4_f")  __vector(float[4])   acosh(__vector(float[4]));
pragma(mangle,"_Z5acoshDv8_f")  __vector(float[8])   acosh(__vector(float[8]));
pragma(mangle,"_Z5acoshDv16_f") __vector(float[16])  acosh(__vector(float[16]));
pragma(mangle,"_Z5acoshd")               double      acosh(         double);
pragma(mangle,"_Z5acoshDv2_d")  __vector(double[2])  acosh(__vector(double[2]));
pragma(mangle,"_Z5acoshDv3_d")  __vector(double[3])  acosh(__vector(double[3]));
pragma(mangle,"_Z5acoshDv4_d")  __vector(double[4])  acosh(__vector(double[4]));
pragma(mangle,"_Z5acoshDv8_d")  __vector(double[8])  acosh(__vector(double[8]));
pragma(mangle,"_Z5acoshDv16_d") __vector(double[16]) acosh(__vector(double[16]));

// acospi
pragma(mangle,"_Z6acospif")               float       acospi(         float);
pragma(mangle,"_Z6acospiDv2_f")  __vector(float[2])   acospi(__vector(float[2]));
pragma(mangle,"_Z6acospiDv3_f")  __vector(float[3])   acospi(__vector(float[3]));
pragma(mangle,"_Z6acospiDv4_f")  __vector(float[4])   acospi(__vector(float[4]));
pragma(mangle,"_Z6acospiDv8_f")  __vector(float[8])   acospi(__vector(float[8]));
pragma(mangle,"_Z6acospiDv16_f") __vector(float[16])  acospi(__vector(float[16]));
pragma(mangle,"_Z6acospid")               double      acospi(         double);
pragma(mangle,"_Z6acospiDv2_d")  __vector(double[2])  acospi(__vector(double[2]));
pragma(mangle,"_Z6acospiDv3_d")  __vector(double[3])  acospi(__vector(double[3]));
pragma(mangle,"_Z6acospiDv4_d")  __vector(double[4])  acospi(__vector(double[4]));
pragma(mangle,"_Z6acospiDv8_d")  __vector(double[8])  acospi(__vector(double[8]));
pragma(mangle,"_Z6acospiDv16_d") __vector(double[16]) acospi(__vector(double[16]));

// asin
pragma(mangle,"_Z4asinf")               float       asin(         float);
pragma(mangle,"_Z4asinDv2_f")  __vector(float[2])   asin(__vector(float[2]));
pragma(mangle,"_Z4asinDv3_f")  __vector(float[3])   asin(__vector(float[3]));
pragma(mangle,"_Z4asinDv4_f")  __vector(float[4])   asin(__vector(float[4]));
pragma(mangle,"_Z4asinDv8_f")  __vector(float[8])   asin(__vector(float[8]));
pragma(mangle,"_Z4asinDv16_f") __vector(float[16])  asin(__vector(float[16]));
pragma(mangle,"_Z4asind")               double      asin(         double);
pragma(mangle,"_Z4asinDv2_d")  __vector(double[2])  asin(__vector(double[2]));
pragma(mangle,"_Z4asinDv3_d")  __vector(double[3])  asin(__vector(double[3]));
pragma(mangle,"_Z4asinDv4_d")  __vector(double[4])  asin(__vector(double[4]));
pragma(mangle,"_Z4asinDv8_d")  __vector(double[8])  asin(__vector(double[8]));
pragma(mangle,"_Z4asinDv16_d") __vector(double[16]) asin(__vector(double[16]));

// asinh
pragma(mangle,"_Z5asinhf")               float       asinh(         float);
pragma(mangle,"_Z5asinhDv2_f")  __vector(float[2])   asinh(__vector(float[2]));
pragma(mangle,"_Z5asinhDv3_f")  __vector(float[3])   asinh(__vector(float[3]));
pragma(mangle,"_Z5asinhDv4_f")  __vector(float[4])   asinh(__vector(float[4]));
pragma(mangle,"_Z5asinhDv8_f")  __vector(float[8])   asinh(__vector(float[8]));
pragma(mangle,"_Z5asinhDv16_f") __vector(float[16])  asinh(__vector(float[16]));
pragma(mangle,"_Z5asinhd")               double      asinh(         double);
pragma(mangle,"_Z5asinhDv2_d")  __vector(double[2])  asinh(__vector(double[2]));
pragma(mangle,"_Z5asinhDv3_d")  __vector(double[3])  asinh(__vector(double[3]));
pragma(mangle,"_Z5asinhDv4_d")  __vector(double[4])  asinh(__vector(double[4]));
pragma(mangle,"_Z5asinhDv8_d")  __vector(double[8])  asinh(__vector(double[8]));
pragma(mangle,"_Z5asinhDv16_d") __vector(double[16]) asinh(__vector(double[16]));

// asinpi
pragma(mangle,"_Z6asinpif")               float       asinpi(         float);
pragma(mangle,"_Z6asinpiDv2_f")  __vector(float[2])   asinpi(__vector(float[2]));
pragma(mangle,"_Z6asinpiDv3_f")  __vector(float[3])   asinpi(__vector(float[3]));
pragma(mangle,"_Z6asinpiDv4_f")  __vector(float[4])   asinpi(__vector(float[4]));
pragma(mangle,"_Z6asinpiDv8_f")  __vector(float[8])   asinpi(__vector(float[8]));
pragma(mangle,"_Z6asinpiDv16_f") __vector(float[16])  asinpi(__vector(float[16]));
pragma(mangle,"_Z6asinpid")               double      asinpi(         double);
pragma(mangle,"_Z6asinpiDv2_d")  __vector(double[2])  asinpi(__vector(double[2]));
pragma(mangle,"_Z6asinpiDv3_d")  __vector(double[3])  asinpi(__vector(double[3]));
pragma(mangle,"_Z6asinpiDv4_d")  __vector(double[4])  asinpi(__vector(double[4]));
pragma(mangle,"_Z6asinpiDv8_d")  __vector(double[8])  asinpi(__vector(double[8]));
pragma(mangle,"_Z6asinpiDv16_d") __vector(double[16]) asinpi(__vector(double[16]));

// atan
pragma(mangle,"_Z4atanf")               float       atan(         float);
pragma(mangle,"_Z4atanDv2_f")  __vector(float[2])   atan(__vector(float[2]));
pragma(mangle,"_Z4atanDv3_f")  __vector(float[3])   atan(__vector(float[3]));
pragma(mangle,"_Z4atanDv4_f")  __vector(float[4])   atan(__vector(float[4]));
pragma(mangle,"_Z4atanDv8_f")  __vector(float[8])   atan(__vector(float[8]));
pragma(mangle,"_Z4atanDv16_f") __vector(float[16])  atan(__vector(float[16]));
pragma(mangle,"_Z4atand")               double      atan(         double);
pragma(mangle,"_Z4atanDv2_d")  __vector(double[2])  atan(__vector(double[2]));
pragma(mangle,"_Z4atanDv3_d")  __vector(double[3])  atan(__vector(double[3]));
pragma(mangle,"_Z4atanDv4_d")  __vector(double[4])  atan(__vector(double[4]));
pragma(mangle,"_Z4atanDv8_d")  __vector(double[8])  atan(__vector(double[8]));
pragma(mangle,"_Z4atanDv16_d") __vector(double[16]) atan(__vector(double[16]));

// atan2
pragma(mangle,"_Z5atan2ff")                float       atan2(         float,                float);
pragma(mangle,"_Z5atan2Dv2_fS_")  __vector(float[2])   atan2(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z5atan2Dv3_fS_")  __vector(float[3])   atan2(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z5atan2Dv4_fS_")  __vector(float[4])   atan2(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z5atan2Dv8_fS_")  __vector(float[8])   atan2(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z5atan2Dv16_fS_") __vector(float[16])  atan2(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z5atan2dd")                double      atan2(         double,              double);
pragma(mangle,"_Z5atan2Dv2_dS_")  __vector(double[2])  atan2(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z5atan2Dv3_dS_")  __vector(double[3])  atan2(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z5atan2Dv4_dS_")  __vector(double[4])  atan2(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z5atan2Dv8_dS_")  __vector(double[8])  atan2(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z5atan2Dv16_dS_") __vector(double[16]) atan2(__vector(double[16]), __vector(double[16]));

// atanh
pragma(mangle,"_Z5atanhf")               float       atanh(         float);
pragma(mangle,"_Z5atanhDv2_f")  __vector(float[2])   atanh(__vector(float[2]));
pragma(mangle,"_Z5atanhDv3_f")  __vector(float[3])   atanh(__vector(float[3]));
pragma(mangle,"_Z5atanhDv4_f")  __vector(float[4])   atanh(__vector(float[4]));
pragma(mangle,"_Z5atanhDv8_f")  __vector(float[8])   atanh(__vector(float[8]));
pragma(mangle,"_Z5atanhDv16_f") __vector(float[16])  atanh(__vector(float[16]));
pragma(mangle,"_Z5atanhd")               double      atanh(         double);
pragma(mangle,"_Z5atanhDv2_d")  __vector(double[2])  atanh(__vector(double[2]));
pragma(mangle,"_Z5atanhDv3_d")  __vector(double[3])  atanh(__vector(double[3]));
pragma(mangle,"_Z5atanhDv4_d")  __vector(double[4])  atanh(__vector(double[4]));
pragma(mangle,"_Z5atanhDv8_d")  __vector(double[8])  atanh(__vector(double[8]));
pragma(mangle,"_Z5atanhDv16_d") __vector(double[16]) atanh(__vector(double[16]));

// atanpi
pragma(mangle,"_Z6atanpif")               float       atanpi(         float);
pragma(mangle,"_Z6atanpiDv2_f")  __vector(float[2])   atanpi(__vector(float[2]));
pragma(mangle,"_Z6atanpiDv3_f")  __vector(float[3])   atanpi(__vector(float[3]));
pragma(mangle,"_Z6atanpiDv4_f")  __vector(float[4])   atanpi(__vector(float[4]));
pragma(mangle,"_Z6atanpiDv8_f")  __vector(float[8])   atanpi(__vector(float[8]));
pragma(mangle,"_Z6atanpiDv16_f") __vector(float[16])  atanpi(__vector(float[16]));
pragma(mangle,"_Z6atanpid")               double      atanpi(         double);
pragma(mangle,"_Z6atanpiDv2_d")  __vector(double[2])  atanpi(__vector(double[2]));
pragma(mangle,"_Z6atanpiDv3_d")  __vector(double[3])  atanpi(__vector(double[3]));
pragma(mangle,"_Z6atanpiDv4_d")  __vector(double[4])  atanpi(__vector(double[4]));
pragma(mangle,"_Z6atanpiDv8_d")  __vector(double[8])  atanpi(__vector(double[8]));
pragma(mangle,"_Z6atanpiDv16_d") __vector(double[16]) atanpi(__vector(double[16]));

// atan2pi
pragma(mangle,"_Z7atan2piff")                float       atan2pi(         float,                float);
pragma(mangle,"_Z7atan2piDv2_fS_")  __vector(float[2])   atan2pi(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z7atan2piDv3_fS_")  __vector(float[3])   atan2pi(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z7atan2piDv4_fS_")  __vector(float[4])   atan2pi(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z7atan2piDv8_fS_")  __vector(float[8])   atan2pi(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z7atan2piDv16_fS_") __vector(float[16])  atan2pi(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z7atan2pidd")                double      atan2pi(         double,               double);
pragma(mangle,"_Z7atan2piDv2_dS_")  __vector(double[2])  atan2pi(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z7atan2piDv3_dS_")  __vector(double[3])  atan2pi(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z7atan2piDv4_dS_")  __vector(double[4])  atan2pi(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z7atan2piDv8_dS_")  __vector(double[8])  atan2pi(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z7atan2piDv16_dS_") __vector(double[16]) atan2pi(__vector(double[16]), __vector(double[16]));

// cbrt
pragma(mangle,"_Z4cbrtf")               float       cbrt(         float);
pragma(mangle,"_Z4cbrtDv2_f")  __vector(float[2])   cbrt(__vector(float[2]));
pragma(mangle,"_Z4cbrtDv3_f")  __vector(float[3])   cbrt(__vector(float[3]));
pragma(mangle,"_Z4cbrtDv4_f")  __vector(float[4])   cbrt(__vector(float[4]));
pragma(mangle,"_Z4cbrtDv8_f")  __vector(float[8])   cbrt(__vector(float[8]));
pragma(mangle,"_Z4cbrtDv16_f") __vector(float[16])  cbrt(__vector(float[16]));
pragma(mangle,"_Z4cbrtd")               double      cbrt(         double);
pragma(mangle,"_Z4cbrtDv2_d")  __vector(double[2])  cbrt(__vector(double[2]));
pragma(mangle,"_Z4cbrtDv3_d")  __vector(double[3])  cbrt(__vector(double[3]));
pragma(mangle,"_Z4cbrtDv4_d")  __vector(double[4])  cbrt(__vector(double[4]));
pragma(mangle,"_Z4cbrtDv8_d")  __vector(double[8])  cbrt(__vector(double[8]));
pragma(mangle,"_Z4cbrtDv16_d") __vector(double[16]) cbrt(__vector(double[16]));

// ceil
pragma(mangle,"_Z4ceilf")               float       ceil(         float);
pragma(mangle,"_Z4ceilDv2_f")  __vector(float[2])   ceil(__vector(float[2]));
pragma(mangle,"_Z4ceilDv3_f")  __vector(float[3])   ceil(__vector(float[3]));
pragma(mangle,"_Z4ceilDv4_f")  __vector(float[4])   ceil(__vector(float[4]));
pragma(mangle,"_Z4ceilDv8_f")  __vector(float[8])   ceil(__vector(float[8]));
pragma(mangle,"_Z4ceilDv16_f") __vector(float[16])  ceil(__vector(float[16]));
pragma(mangle,"_Z4ceild")               double      ceil(         double);
pragma(mangle,"_Z4ceilDv2_d")  __vector(double[2])  ceil(__vector(double[2]));
pragma(mangle,"_Z4ceilDv3_d")  __vector(double[3])  ceil(__vector(double[3]));
pragma(mangle,"_Z4ceilDv4_d")  __vector(double[4])  ceil(__vector(double[4]));
pragma(mangle,"_Z4ceilDv8_d")  __vector(double[8])  ceil(__vector(double[8]));
pragma(mangle,"_Z4ceilDv16_d") __vector(double[16]) ceil(__vector(double[16]));

// copysign
pragma(mangle,"_Z8copysignff")                float       copysign(         float,                float);
pragma(mangle,"_Z8copysignDv2_fS_")  __vector(float[2])   copysign(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z8copysignDv3_fS_")  __vector(float[3])   copysign(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z8copysignDv4_fS_")  __vector(float[4])   copysign(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z8copysignDv8_fS_")  __vector(float[8])   copysign(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z8copysignDv16_fS_") __vector(float[16])  copysign(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z8copysigndd")                double      copysign(         double,               double);
pragma(mangle,"_Z8copysignDv2_dS_")  __vector(double[2])  copysign(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z8copysignDv3_dS_")  __vector(double[3])  copysign(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z8copysignDv4_dS_")  __vector(double[4])  copysign(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z8copysignDv8_dS_")  __vector(double[8])  copysign(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z8copysignDv16_dS_") __vector(double[16]) copysign(__vector(double[16]), __vector(double[16]));

// cos
pragma(mangle,"_Z3cosf")               float       cos(         float);
pragma(mangle,"_Z3cosDv2_f")  __vector(float[2])   cos(__vector(float[2]));
pragma(mangle,"_Z3cosDv3_f")  __vector(float[3])   cos(__vector(float[3]));
pragma(mangle,"_Z3cosDv4_f")  __vector(float[4])   cos(__vector(float[4]));
pragma(mangle,"_Z3cosDv8_f")  __vector(float[8])   cos(__vector(float[8]));
pragma(mangle,"_Z3cosDv16_f") __vector(float[16])  cos(__vector(float[16]));
pragma(mangle,"_Z3cosd")               double      cos(         double);
pragma(mangle,"_Z3cosDv2_d")  __vector(double[2])  cos(__vector(double[2]));
pragma(mangle,"_Z3cosDv3_d")  __vector(double[3])  cos(__vector(double[3]));
pragma(mangle,"_Z3cosDv4_d")  __vector(double[4])  cos(__vector(double[4]));
pragma(mangle,"_Z3cosDv8_d")  __vector(double[8])  cos(__vector(double[8]));
pragma(mangle,"_Z3cosDv16_d") __vector(double[16]) cos(__vector(double[16]));

// cosh
pragma(mangle,"_Z4coshf")               float       cosh(         float);
pragma(mangle,"_Z4coshDv2_f")  __vector(float[2])   cosh(__vector(float[2]));
pragma(mangle,"_Z4coshDv3_f")  __vector(float[3])   cosh(__vector(float[3]));
pragma(mangle,"_Z4coshDv4_f")  __vector(float[4])   cosh(__vector(float[4]));
pragma(mangle,"_Z4coshDv8_f")  __vector(float[8])   cosh(__vector(float[8]));
pragma(mangle,"_Z4coshDv16_f") __vector(float[16])  cosh(__vector(float[16]));
pragma(mangle,"_Z4coshd")               double      cosh(         double);
pragma(mangle,"_Z4coshDv2_d")  __vector(double[2])  cosh(__vector(double[2]));
pragma(mangle,"_Z4coshDv3_d")  __vector(double[3])  cosh(__vector(double[3]));
pragma(mangle,"_Z4coshDv4_d")  __vector(double[4])  cosh(__vector(double[4]));
pragma(mangle,"_Z4coshDv8_d")  __vector(double[8])  cosh(__vector(double[8]));
pragma(mangle,"_Z4coshDv16_d") __vector(double[16]) cosh(__vector(double[16]));

// cospi
pragma(mangle,"_Z5cospif")               float       cospi(         float);
pragma(mangle,"_Z5cospiDv2_f")  __vector(float[2])   cospi(__vector(float[2]));
pragma(mangle,"_Z5cospiDv3_f")  __vector(float[3])   cospi(__vector(float[3]));
pragma(mangle,"_Z5cospiDv4_f")  __vector(float[4])   cospi(__vector(float[4]));
pragma(mangle,"_Z5cospiDv8_f")  __vector(float[8])   cospi(__vector(float[8]));
pragma(mangle,"_Z5cospiDv16_f") __vector(float[16])  cospi(__vector(float[16]));
pragma(mangle,"_Z5cospid")               double      cospi(         double);
pragma(mangle,"_Z5cospiDv2_d")  __vector(double[2])  cospi(__vector(double[2]));
pragma(mangle,"_Z5cospiDv3_d")  __vector(double[3])  cospi(__vector(double[3]));
pragma(mangle,"_Z5cospiDv4_d")  __vector(double[4])  cospi(__vector(double[4]));
pragma(mangle,"_Z5cospiDv8_d")  __vector(double[8])  cospi(__vector(double[8]));
pragma(mangle,"_Z5cospiDv16_d") __vector(double[16]) cospi(__vector(double[16]));

// erfc
pragma(mangle,"_Z4erfcf")               float       erfc(         float);
pragma(mangle,"_Z4erfcDv2_f")  __vector(float[2])   erfc(__vector(float[2]));
pragma(mangle,"_Z4erfcDv3_f")  __vector(float[3])   erfc(__vector(float[3]));
pragma(mangle,"_Z4erfcDv4_f")  __vector(float[4])   erfc(__vector(float[4]));
pragma(mangle,"_Z4erfcDv8_f")  __vector(float[8])   erfc(__vector(float[8]));
pragma(mangle,"_Z4erfcDv16_f") __vector(float[16])  erfc(__vector(float[16]));
pragma(mangle,"_Z4erfcd")               double      erfc(         double);
pragma(mangle,"_Z4erfcDv2_d")  __vector(double[2])  erfc(__vector(double[2]));
pragma(mangle,"_Z4erfcDv3_d")  __vector(double[3])  erfc(__vector(double[3]));
pragma(mangle,"_Z4erfcDv4_d")  __vector(double[4])  erfc(__vector(double[4]));
pragma(mangle,"_Z4erfcDv8_d")  __vector(double[8])  erfc(__vector(double[8]));
pragma(mangle,"_Z4erfcDv16_d") __vector(double[16]) erfc(__vector(double[16]));

// erf
pragma(mangle,"_Z3erff")               float       erf(         float);
pragma(mangle,"_Z3erfDv2_f")  __vector(float[2])   erf(__vector(float[2]));
pragma(mangle,"_Z3erfDv3_f")  __vector(float[3])   erf(__vector(float[3]));
pragma(mangle,"_Z3erfDv4_f")  __vector(float[4])   erf(__vector(float[4]));
pragma(mangle,"_Z3erfDv8_f")  __vector(float[8])   erf(__vector(float[8]));
pragma(mangle,"_Z3erfDv16_f") __vector(float[16])  erf(__vector(float[16]));
pragma(mangle,"_Z3erfd")               double      erf(         double);
pragma(mangle,"_Z3erfDv2_d")  __vector(double[2])  erf(__vector(double[2]));
pragma(mangle,"_Z3erfDv3_d")  __vector(double[3])  erf(__vector(double[3]));
pragma(mangle,"_Z3erfDv4_d")  __vector(double[4])  erf(__vector(double[4]));
pragma(mangle,"_Z3erfDv8_d")  __vector(double[8])  erf(__vector(double[8]));
pragma(mangle,"_Z3erfDv16_d") __vector(double[16]) erf(__vector(double[16]));

// exp
pragma(mangle,"_Z3expf")               float       exp(         float);
pragma(mangle,"_Z3expDv2_f")  __vector(float[2])   exp(__vector(float[2]));
pragma(mangle,"_Z3expDv3_f")  __vector(float[3])   exp(__vector(float[3]));
pragma(mangle,"_Z3expDv4_f")  __vector(float[4])   exp(__vector(float[4]));
pragma(mangle,"_Z3expDv8_f")  __vector(float[8])   exp(__vector(float[8]));
pragma(mangle,"_Z3expDv16_f") __vector(float[16])  exp(__vector(float[16]));
pragma(mangle,"_Z3expd")               double      exp(         double);
pragma(mangle,"_Z3expDv2_d")  __vector(double[2])  exp(__vector(double[2]));
pragma(mangle,"_Z3expDv3_d")  __vector(double[3])  exp(__vector(double[3]));
pragma(mangle,"_Z3expDv4_d")  __vector(double[4])  exp(__vector(double[4]));
pragma(mangle,"_Z3expDv8_d")  __vector(double[8])  exp(__vector(double[8]));
pragma(mangle,"_Z3expDv16_d") __vector(double[16]) exp(__vector(double[16]));

// exp2
pragma(mangle,"_Z4exp2f")               float       exp2(         float);
pragma(mangle,"_Z4exp2Dv2_f")  __vector(float[2])   exp2(__vector(float[2]));
pragma(mangle,"_Z4exp2Dv3_f")  __vector(float[3])   exp2(__vector(float[3]));
pragma(mangle,"_Z4exp2Dv4_f")  __vector(float[4])   exp2(__vector(float[4]));
pragma(mangle,"_Z4exp2Dv8_f")  __vector(float[8])   exp2(__vector(float[8]));
pragma(mangle,"_Z4exp2Dv16_f") __vector(float[16])  exp2(__vector(float[16]));
pragma(mangle,"_Z4exp2d")               double      exp2(         double);
pragma(mangle,"_Z4exp2Dv2_d")  __vector(double[2])  exp2(__vector(double[2]));
pragma(mangle,"_Z4exp2Dv3_d")  __vector(double[3])  exp2(__vector(double[3]));
pragma(mangle,"_Z4exp2Dv4_d")  __vector(double[4])  exp2(__vector(double[4]));
pragma(mangle,"_Z4exp2Dv8_d")  __vector(double[8])  exp2(__vector(double[8]));
pragma(mangle,"_Z4exp2Dv16_d") __vector(double[16]) exp2(__vector(double[16]));

// exp10
pragma(mangle,"_Z5exp10f")               float       exp10(         float);
pragma(mangle,"_Z5exp10Dv2_f")  __vector(float[2])   exp10(__vector(float[2]));
pragma(mangle,"_Z5exp10Dv3_f")  __vector(float[3])   exp10(__vector(float[3]));
pragma(mangle,"_Z5exp10Dv4_f")  __vector(float[4])   exp10(__vector(float[4]));
pragma(mangle,"_Z5exp10Dv8_f")  __vector(float[8])   exp10(__vector(float[8]));
pragma(mangle,"_Z5exp10Dv16_f") __vector(float[16])  exp10(__vector(float[16]));
pragma(mangle,"_Z5exp10d")               double      exp10(         double);
pragma(mangle,"_Z5exp10Dv2_d")  __vector(double[2])  exp10(__vector(double[2]));
pragma(mangle,"_Z5exp10Dv3_d")  __vector(double[3])  exp10(__vector(double[3]));
pragma(mangle,"_Z5exp10Dv4_d")  __vector(double[4])  exp10(__vector(double[4]));
pragma(mangle,"_Z5exp10Dv8_d")  __vector(double[8])  exp10(__vector(double[8]));
pragma(mangle,"_Z5exp10Dv16_d") __vector(double[16]) exp10(__vector(double[16]));

// expm1
pragma(mangle,"_Z5expm1f")               float       expm1(         float);
pragma(mangle,"_Z5expm1Dv2_f")  __vector(float[2])   expm1(__vector(float[2]));
pragma(mangle,"_Z5expm1Dv3_f")  __vector(float[3])   expm1(__vector(float[3]));
pragma(mangle,"_Z5expm1Dv4_f")  __vector(float[4])   expm1(__vector(float[4]));
pragma(mangle,"_Z5expm1Dv8_f")  __vector(float[8])   expm1(__vector(float[8]));
pragma(mangle,"_Z5expm1Dv16_f") __vector(float[16])  expm1(__vector(float[16]));
pragma(mangle,"_Z5expm1d")               double      expm1(         double);
pragma(mangle,"_Z5expm1Dv2_d")  __vector(double[2])  expm1(__vector(double[2]));
pragma(mangle,"_Z5expm1Dv3_d")  __vector(double[3])  expm1(__vector(double[3]));
pragma(mangle,"_Z5expm1Dv4_d")  __vector(double[4])  expm1(__vector(double[4]));
pragma(mangle,"_Z5expm1Dv8_d")  __vector(double[8])  expm1(__vector(double[8]));
pragma(mangle,"_Z5expm1Dv16_d") __vector(double[16]) expm1(__vector(double[16]));

// fabs
pragma(mangle,"_Z4fabsf")               float       fabs(         float);
pragma(mangle,"_Z4fabsDv2_f")  __vector(float[2])   fabs(__vector(float[2]));
pragma(mangle,"_Z4fabsDv3_f")  __vector(float[3])   fabs(__vector(float[3]));
pragma(mangle,"_Z4fabsDv4_f")  __vector(float[4])   fabs(__vector(float[4]));
pragma(mangle,"_Z4fabsDv8_f")  __vector(float[8])   fabs(__vector(float[8]));
pragma(mangle,"_Z4fabsDv16_f") __vector(float[16])  fabs(__vector(float[16]));
pragma(mangle,"_Z4fabsd")               double      fabs(         double);
pragma(mangle,"_Z4fabsDv2_d")  __vector(double[2])  fabs(__vector(double[2]));
pragma(mangle,"_Z4fabsDv3_d")  __vector(double[3])  fabs(__vector(double[3]));
pragma(mangle,"_Z4fabsDv4_d")  __vector(double[4])  fabs(__vector(double[4]));
pragma(mangle,"_Z4fabsDv8_d")  __vector(double[8])  fabs(__vector(double[8]));
pragma(mangle,"_Z4fabsDv16_d") __vector(double[16]) fabs(__vector(double[16]));

// fdim
pragma(mangle,"_Z4fdimff")                float       fdim(         float,                float);
pragma(mangle,"_Z4fdimDv2_fS_")  __vector(float[2])   fdim(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z4fdimDv3_fS_")  __vector(float[3])   fdim(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z4fdimDv4_fS_")  __vector(float[4])   fdim(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z4fdimDv8_fS_")  __vector(float[8])   fdim(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z4fdimDv16_fS_") __vector(float[16])  fdim(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z4fdimdd")                double      fdim(         double,               double);
pragma(mangle,"_Z4fdimDv2_dS_")  __vector(double[2])  fdim(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z4fdimDv3_dS_")  __vector(double[3])  fdim(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z4fdimDv4_dS_")  __vector(double[4])  fdim(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z4fdimDv8_dS_")  __vector(double[8])  fdim(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z4fdimDv16_dS_") __vector(double[16]) fdim(__vector(double[16]), __vector(double[16]));

// floor
pragma(mangle,"_Z5floorf")               float       floor(         float);
pragma(mangle,"_Z5floorDv2_f")  __vector(float[2])   floor(__vector(float[2]));
pragma(mangle,"_Z5floorDv3_f")  __vector(float[3])   floor(__vector(float[3]));
pragma(mangle,"_Z5floorDv4_f")  __vector(float[4])   floor(__vector(float[4]));
pragma(mangle,"_Z5floorDv8_f")  __vector(float[8])   floor(__vector(float[8]));
pragma(mangle,"_Z5floorDv16_f") __vector(float[16])  floor(__vector(float[16]));
pragma(mangle,"_Z5floord")               double      floor(         double);
pragma(mangle,"_Z5floorDv2_d")  __vector(double[2])  floor(__vector(double[2]));
pragma(mangle,"_Z5floorDv3_d")  __vector(double[3])  floor(__vector(double[3]));
pragma(mangle,"_Z5floorDv4_d")  __vector(double[4])  floor(__vector(double[4]));
pragma(mangle,"_Z5floorDv8_d")  __vector(double[8])  floor(__vector(double[8]));
pragma(mangle,"_Z5floorDv16_d") __vector(double[16]) floor(__vector(double[16]));

// fma
pragma(mangle,"_Z3fmafff")                float      fma(         float,                float,               float);
pragma(mangle,"_Z3fmaDv2_fS_S_") __vector(float[2])  fma(__vector(float[2]),  __vector(float[2]),  __vector(float[2]));
pragma(mangle,"_Z3fmaDv3_fS_S_") __vector(float[3])  fma(__vector(float[3]),  __vector(float[3]),  __vector(float[3]));
pragma(mangle,"_Z3fmaDv4_fS_S_") __vector(float[4])  fma(__vector(float[4]),  __vector(float[4]),  __vector(float[4]));
pragma(mangle,"_Z3fmaDv8_fS_S_") __vector(float[8])  fma(__vector(float[8]),  __vector(float[8]),  __vector(float[8]));
pragma(mangle,"_Z3fmaDv16_fS_S_")__vector(float[16]) fma(__vector(float[16]), __vector(float[16]), __vector(float[16]));
pragma(mangle,"_Z3fmaddd")                double     fma(         double,              double,              double);
pragma(mangle,"_Z3fmaDv2_dS_S_") __vector(double[2]) fma(__vector(double[2]), __vector(double[2]), __vector(double[2]));
pragma(mangle,"_Z3fmaDv3_dS_S_") __vector(double[3]) fma(__vector(double[3]), __vector(double[3]), __vector(double[3]));
pragma(mangle,"_Z3fmaDv4_dS_S_") __vector(double[4]) fma(__vector(double[4]), __vector(double[4]), __vector(double[4]));
pragma(mangle,"_Z3fmaDv8_dS_S_") __vector(double[8]) fma(__vector(double[8]), __vector(double[8]), __vector(double[8]));
pragma(mangle,"_Z3fmaDv16_dS_S_")__vector(double[16])fma(__vector(double[16]),__vector(double[16]),__vector(double[16]));

// fmax
pragma(mangle,"_Z4fmaxff")                float       fmax(         float,                float);
pragma(mangle,"_Z4fmaxDv2_fS_")  __vector(float[2])   fmax(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z4fmaxDv3_fS_")  __vector(float[3])   fmax(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z4fmaxDv4_fS_")  __vector(float[4])   fmax(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z4fmaxDv8_fS_")  __vector(float[8])   fmax(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z4fmaxDv16_fS_") __vector(float[16])  fmax(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z4fmaxDv2_ff")   __vector(float[2])   fmax(__vector(float[2]),            float);
pragma(mangle,"_Z4fmaxDv3_ff")   __vector(float[3])   fmax(__vector(float[3]),            float);
pragma(mangle,"_Z4fmaxDv4_ff")   __vector(float[4])   fmax(__vector(float[4]),            float);
pragma(mangle,"_Z4fmaxDv8_ff")   __vector(float[8])   fmax(__vector(float[8]),            float);
pragma(mangle,"_Z4fmaxDv16_ff")  __vector(float[16])  fmax(__vector(float[16]),           float);
pragma(mangle,"_Z4fmaxdd")                double      fmax(         double,               double);
pragma(mangle,"_Z4fmaxDv2_dS_")  __vector(double[2])  fmax(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z4fmaxDv3_dS_")  __vector(double[3])  fmax(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z4fmaxDv4_dS_")  __vector(double[4])  fmax(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z4fmaxDv8_dS_")  __vector(double[8])  fmax(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z4fmaxDv16_dS_") __vector(double[16]) fmax(__vector(double[16]), __vector(double[16]));
pragma(mangle,"_Z4fmaxDv2_dd")   __vector(double[2])  fmax(__vector(double[2]),           double);
pragma(mangle,"_Z4fmaxDv3_dd")   __vector(double[3])  fmax(__vector(double[3]),           double);
pragma(mangle,"_Z4fmaxDv4_dd")   __vector(double[4])  fmax(__vector(double[4]),           double);
pragma(mangle,"_Z4fmaxDv8_dd")   __vector(double[8])  fmax(__vector(double[8]),           double);
pragma(mangle,"_Z4fmaxDv16_dd")  __vector(double[16]) fmax(__vector(double[16]),          double);

// fmin
pragma(mangle,"_Z4fminff")                float       fmin(         float,                float);
pragma(mangle,"_Z4fminDv2_fS_")  __vector(float[2])   fmin(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z4fminDv3_fS_")  __vector(float[3])   fmin(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z4fminDv4_fS_")  __vector(float[4])   fmin(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z4fminDv8_fS_")  __vector(float[8])   fmin(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z4fminDv16_fS_") __vector(float[16])  fmin(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z4fminDv2_ff")   __vector(float[2])   fmin(__vector(float[2]),            float);
pragma(mangle,"_Z4fminDv3_ff")   __vector(float[3])   fmin(__vector(float[3]),            float);
pragma(mangle,"_Z4fminDv4_ff")   __vector(float[4])   fmin(__vector(float[4]),            float);
pragma(mangle,"_Z4fminDv8_ff")   __vector(float[8])   fmin(__vector(float[8]),            float);
pragma(mangle,"_Z4fminDv16_ff")  __vector(float[16])  fmin(__vector(float[16]),           float);
pragma(mangle,"_Z4fmindd")                double      fmin(         double,               double);
pragma(mangle,"_Z4fminDv2_dS_")  __vector(double[2])  fmin(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z4fminDv3_dS_")  __vector(double[3])  fmin(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z4fminDv4_dS_")  __vector(double[4])  fmin(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z4fminDv8_dS_")  __vector(double[8])  fmin(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z4fminDv16_dS_") __vector(double[16]) fmin(__vector(double[16]), __vector(double[16]));
pragma(mangle,"_Z4fminDv2_dd")   __vector(double[2])  fmin(__vector(double[2]),           double);
pragma(mangle,"_Z4fminDv3_dd")   __vector(double[3])  fmin(__vector(double[3]),           double);
pragma(mangle,"_Z4fminDv4_dd")   __vector(double[4])  fmin(__vector(double[4]),           double);
pragma(mangle,"_Z4fminDv8_dd")   __vector(double[8])  fmin(__vector(double[8]),           double);
pragma(mangle,"_Z4fminDv16_dd")  __vector(double[16]) fmin(__vector(double[16]),          double);

// fmod
pragma(mangle,"_Z4fmodff")                float       fmod(         float,                float);
pragma(mangle,"_Z4fmodDv2_fS_")  __vector(float[2])   fmod(__vector(float[2]),   __vector(float[2]));
pragma(mangle,"_Z4fmodDv3_fS_")  __vector(float[3])   fmod(__vector(float[3]),   __vector(float[3]));
pragma(mangle,"_Z4fmodDv4_fS_")  __vector(float[4])   fmod(__vector(float[4]),   __vector(float[4]));
pragma(mangle,"_Z4fmodDv8_fS_")  __vector(float[8])   fmod(__vector(float[8]),   __vector(float[8]));
pragma(mangle,"_Z4fmodDv16_fS_") __vector(float[16])  fmod(__vector(float[16]),  __vector(float[16]));
pragma(mangle,"_Z4fmoddd")                double      fmod(         double,               double);
pragma(mangle,"_Z4fmodDv2_dS_")  __vector(double[2])  fmod(__vector(double[2]),  __vector(double[2]));
pragma(mangle,"_Z4fmodDv3_dS_")  __vector(double[3])  fmod(__vector(double[3]),  __vector(double[3]));
pragma(mangle,"_Z4fmodDv4_dS_")  __vector(double[4])  fmod(__vector(double[4]),  __vector(double[4]));
pragma(mangle,"_Z4fmodDv8_dS_")  __vector(double[8])  fmod(__vector(double[8]),  __vector(double[8]));
pragma(mangle,"_Z4fmodDv16_dS_") __vector(double[16]) fmod(__vector(double[16]), __vector(double[16]));

// fract
pragma(mangle,"_Z5fractfPU3AS4f")
        float       fract(          float,       GenericPointer!(         float));
pragma(mangle,"_Z5fractDv2_fPU3AS4S_")
__vector(float[2])   fract(__vector(float[2]),   GenericPointer!(__vector(float[2])));
pragma(mangle,"_Z5fractDv3_fPU3AS4S_")
__vector(float[3])   fract(__vector(float[3]),   GenericPointer!(__vector(float[3])));
pragma(mangle,"_Z5fractDv4_fPU3AS4S_")
__vector(float[4])   fract(__vector(float[4]),   GenericPointer!(__vector(float[4])));
pragma(mangle,"_Z5fractDv8_fPU3AS4S_")
__vector(float[8])   fract(__vector(float[8]),   GenericPointer!(__vector(float[8])));
pragma(mangle,"_Z5fractDv16_fPU3AS4S_")
__vector(float[16])  fract(__vector(float[16]),  GenericPointer!(__vector(float[16])));
pragma(mangle,"_Z5fractdPU3AS4d")
        double      fract(          double,      GenericPointer!(         double));
pragma(mangle,"_Z5fractDv2_dPU3AS4S_")
__vector(double[2])  fract(__vector(double[2]),  GenericPointer!(__vector(double[2])));
pragma(mangle,"_Z5fractDv3_dPU3AS4S_")
__vector(double[3])  fract(__vector(double[3]),  GenericPointer!(__vector(double[3])));
pragma(mangle,"_Z5fractDv4_dPU3AS4S_")
__vector(double[4])  fract(__vector(double[4]),  GenericPointer!(__vector(double[4])));
pragma(mangle,"_Z5fractDv8_dPU3AS4S_")
__vector(double[8])  fract(__vector(double[8]),  GenericPointer!(__vector(double[8])));
pragma(mangle,"_Z5fractDv16_dPU3AS4S_")
__vector(double[16]) fract(__vector(double[16]), GenericPointer!(__vector(double[16])));
pragma(mangle,"_Z5fractfPU3AS1f")
        float       fract(          float,       GlobalPointer!(         float));
pragma(mangle,"_Z5fractDv2_fPU3AS1S_")
__vector(float[2])   fract(__vector(float[2]),   GlobalPointer!(__vector(float[2])));
pragma(mangle,"_Z5fractDv3_fPU3AS1S_")
__vector(float[3])   fract(__vector(float[3]),   GlobalPointer!(__vector(float[3])));
pragma(mangle,"_Z5fractDv4_fPU3AS1S_")
__vector(float[4])   fract(__vector(float[4]),   GlobalPointer!(__vector(float[4])));
pragma(mangle,"_Z5fractDv8_fPU3AS1S_")
__vector(float[8])   fract(__vector(float[8]),   GlobalPointer!(__vector(float[8])));
pragma(mangle,"_Z5fractDv16_fPU3AS1S_")
__vector(float[16])  fract(__vector(float[16]),  GlobalPointer!(__vector(float[16])));
pragma(mangle,"_Z5fractdPU3AS1d")
        double      fract(          double,      GlobalPointer!(         double));
pragma(mangle,"_Z5fractDv2_dPU3AS1S_")
__vector(double[2])  fract(__vector(double[2]),  GlobalPointer!(__vector(double[2])));
pragma(mangle,"_Z5fractDv3_dPU3AS1S_")
__vector(double[3])  fract(__vector(double[3]),  GlobalPointer!(__vector(double[3])));
pragma(mangle,"_Z5fractDv4_dPU3AS1S_")
__vector(double[4])  fract(__vector(double[4]),  GlobalPointer!(__vector(double[4])));
pragma(mangle,"_Z5fractDv8_dPU3AS1S_")
__vector(double[8])  fract(__vector(double[8]),  GlobalPointer!(__vector(double[8])));
pragma(mangle,"_Z5fractDv16_dPU3AS1S_")
__vector(double[16]) fract(__vector(double[16]), GlobalPointer!(__vector(double[16])));
pragma(mangle,"_Z5fractfPU3AS3f")
        float       fract(          float,       SharedPointer!(         float));
pragma(mangle,"_Z5fractDv2_fPU3AS3S_")
__vector(float[2])   fract(__vector(float[2]),   SharedPointer!(__vector(float[2])));
pragma(mangle,"_Z5fractDv3_fPU3AS3S_")
__vector(float[3])   fract(__vector(float[3]),   SharedPointer!(__vector(float[3])));
pragma(mangle,"_Z5fractDv4_fPU3AS3S_")
__vector(float[4])   fract(__vector(float[4]),   SharedPointer!(__vector(float[4])));
pragma(mangle,"_Z5fractDv8_fPU3AS3S_")
__vector(float[8])   fract(__vector(float[8]),   SharedPointer!(__vector(float[8])));
pragma(mangle,"_Z5fractDv16_fPU3AS3S_")
__vector(float[16])  fract(__vector(float[16]),  SharedPointer!(__vector(float[16])));
pragma(mangle,"_Z5fractdPU3AS3d")
        double      fract(          double,      SharedPointer!(         double));
pragma(mangle,"_Z5fractDv2_dPU3AS3S_")
__vector(double[2])  fract(__vector(double[2]),  SharedPointer!(__vector(double[2])));
pragma(mangle,"_Z5fractDv3_dPU3AS3S_")
__vector(double[3])  fract(__vector(double[3]),  SharedPointer!(__vector(double[3])));
pragma(mangle,"_Z5fractDv4_dPU3AS3S_")
__vector(double[4])  fract(__vector(double[4]),  SharedPointer!(__vector(double[4])));
pragma(mangle,"_Z5fractDv8_dPU3AS3S_")
__vector(double[8])  fract(__vector(double[8]),  SharedPointer!(__vector(double[8])));
pragma(mangle,"_Z5fractDv16_dPU3AS3S_")
__vector(double[16]) fract(__vector(double[16]), SharedPointer!(__vector(double[16])));
pragma(mangle,"_Z5fractfPf")
        float       fract(          float,       PrivatePointer!(         float));
pragma(mangle,"_Z5fractDv2_fPS_")
__vector(float[2])   fract(__vector(float[2]),   PrivatePointer!(__vector(float[2])));
pragma(mangle,"_Z5fractDv3_fPS_")
__vector(float[3])   fract(__vector(float[3]),   PrivatePointer!(__vector(float[3])));
pragma(mangle,"_Z5fractDv4_fPS_")
__vector(float[4])   fract(__vector(float[4]),   PrivatePointer!(__vector(float[4])));
pragma(mangle,"_Z5fractDv8_fPS_")
__vector(float[8])   fract(__vector(float[8]),   PrivatePointer!(__vector(float[8])));
pragma(mangle,"_Z5fractDv16_fPS_")
__vector(float[16])  fract(__vector(float[16]),  PrivatePointer!(__vector(float[16])));
pragma(mangle,"_Z5fractdPd")
        double      fract(          double,      PrivatePointer!(         double));
pragma(mangle,"_Z5fractDv2_dPS_")
__vector(double[2])  fract(__vector(double[2]),  PrivatePointer!(__vector(double[2])));
pragma(mangle,"_Z5fractDv3_dPS_")
__vector(double[3])  fract(__vector(double[3]),  PrivatePointer!(__vector(double[3])));
pragma(mangle,"_Z5fractDv4_dPS_")
__vector(double[4])  fract(__vector(double[4]),  PrivatePointer!(__vector(double[4])));
pragma(mangle,"_Z5fractDv8_dPS_")
__vector(double[8])  fract(__vector(double[8]),  PrivatePointer!(__vector(double[8])));
pragma(mangle,"_Z5fractDv16_dPS_")
__vector(double[16]) fract(__vector(double[16]), PrivatePointer!(__vector(double[16])));

// frexp
pragma(mangle,"_Z5frexpfPU3AS4i")
        float       frexp(          float,       GenericPointer!(         int));
pragma(mangle,"_Z5frexpDv2_fPU3AS4Dv2_i")
__vector(float[2])   frexp(__vector(float[2]),   GenericPointer!(__vector(int[2])));
pragma(mangle,"_Z5frexpDv3_fPU3AS4Dv3_i")
__vector(float[3])   frexp(__vector(float[3]),   GenericPointer!(__vector(int[3])));
pragma(mangle,"_Z5frexpDv4_fPU3AS4Dv4_i")
__vector(float[4])   frexp(__vector(float[4]),   GenericPointer!(__vector(int[4])));
pragma(mangle,"_Z5frexpDv8_fPU3AS4Dv8_i")
__vector(float[8])   frexp(__vector(float[8]),   GenericPointer!(__vector(int[8])));
pragma(mangle,"_Z5frexpDv16_fPU3AS4Dv16_i")
__vector(float[16])  frexp(__vector(float[16]),  GenericPointer!(__vector(int[16])));
pragma(mangle,"_Z5frexpdPU3AS4i")
        double      fre
Download .txt
gitextract_f6jo1osq/

├── .gitignore
├── LICENSE.txt
├── README.md
├── docs/
│   ├── 00-prerequsites.md
│   ├── 01-installation.md
│   ├── 02-hardware.md
│   ├── 03-kernels.md
│   ├── 04-std/
│   │   ├── 00-intro.md
│   │   └── 01-index.md
│   ├── 05-driver/
│   │   └── 00-intro.md
│   └── README.md
├── dub.json
└── source/
    └── dcompute/
        ├── driver/
        │   ├── README.md
        │   ├── backend.d
        │   ├── cuda/
        │   │   ├── TODO
        │   │   ├── buffer.d
        │   │   ├── context.d
        │   │   ├── device.d
        │   │   ├── event.d
        │   │   ├── kernel.d
        │   │   ├── memory.d
        │   │   ├── package.d
        │   │   ├── platform.d
        │   │   ├── program.d
        │   │   ├── queue.d
        │   │   └── unified_buffer.d
        │   ├── error.d
        │   ├── ocl/
        │   │   ├── buffer.d
        │   │   ├── context.d
        │   │   ├── device.d
        │   │   ├── event.d
        │   │   ├── image.d
        │   │   ├── kernel.d
        │   │   ├── memory.d
        │   │   ├── package.d
        │   │   ├── platform.d
        │   │   ├── program.d
        │   │   ├── queue.d
        │   │   ├── raw/
        │   │   │   ├── enums.d
        │   │   │   ├── functions.d
        │   │   │   └── package.d
        │   │   ├── sampler.d
        │   │   └── util.d
        │   └── util.d
        ├── kernels/
        │   ├── README.md
        │   └── package.d
        ├── std/
        │   ├── atomic.d
        │   ├── atomic_common.d
        │   ├── cuda/
        │   │   ├── atomic.d
        │   │   ├── index.d
        │   │   └── sync.d
        │   ├── floating.d
        │   ├── index.d
        │   ├── integer.d
        │   ├── memory.d
        │   ├── opencl/
        │   │   ├── image.d
        │   │   ├── index.d
        │   │   ├── math.d
        │   │   └── sync.d
        │   ├── pack.d
        │   ├── package.d
        │   ├── sync.d
        │   └── warp.d
        └── tests/
            ├── dummykernels.d
            ├── main.d
            └── test.d
Condensed preview — 66 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (301K chars).
[
  {
    "path": ".gitignore",
    "chars": 279,
    "preview": "# Compiled Object files\n*.o\n*.obj\n\n# Other outputs from LLVM\n*.ll\n*.spv\n*.spt\n*.ptx\n*.bc\n\n# Compiled Dynamic libraries\n*"
  },
  {
    "path": "LICENSE.txt",
    "chars": 1338,
    "preview": "Boost Software License - Version 1.0 - August 17th, 2003\n\nPermission is hereby granted, free of charge, to any person or"
  },
  {
    "path": "README.md",
    "chars": 5253,
    "preview": "# dcompute\n\n[![Latest version](https://img.shields.io/dub/v/dcompute.svg)](http://code.dlang.org/packages/dcompute)\n[![L"
  },
  {
    "path": "docs/00-prerequsites.md",
    "chars": 1250,
    "preview": "# Prerequisites\n\nIn order to use DCompute there are a few things you need before you start:\n\n* Capable hardware\n\n* Drive"
  },
  {
    "path": "docs/01-installation.md",
    "chars": 2340,
    "preview": "Installation\n============\n\nLDC\n---\n\nAs mentioned previously DCompute requires the use of LDC as the D compiler.\nAll [rec"
  },
  {
    "path": "docs/02-hardware.md",
    "chars": 1574,
    "preview": "Hardware\n========\n\nWriting code for DCompute kernels is a bit different from regular CPU programming.\n\nMost noticable is"
  },
  {
    "path": "docs/03-kernels.md",
    "chars": 3766,
    "preview": "Kernels\n=======\n\nAt the heart of DCompute is are the special attributes `@compute` and `@kernel()` from the module `ldc."
  },
  {
    "path": "docs/04-std/00-intro.md",
    "chars": 235,
    "preview": "The device standard library\n============================\n\nMuch like the regular standard library the device standard lib"
  },
  {
    "path": "docs/04-std/01-index.md",
    "chars": 2357,
    "preview": "Index\n=====\n\nTo do anything useful with DCompute a thread needs to know it's index, it's position.\nIf you take a look at"
  },
  {
    "path": "docs/05-driver/00-intro.md",
    "chars": 4099,
    "preview": "Driver\n======\n\nNow that you've successfully written your kernel, how do you execute it?\nThat's the job of the driver.\n\nT"
  },
  {
    "path": "docs/README.md",
    "chars": 3078,
    "preview": "## Welcome to the DCompute documentation!\n\nDCompute is a library that together with LDC is able to make D compile on GPU"
  },
  {
    "path": "dub.json",
    "chars": 1131,
    "preview": "{\n    \"name\": \"dcompute\",\n    \"description\": \"Native Heterogeneous Computing for D\",\n    \"copyright\": \"Copyright © 2017,"
  },
  {
    "path": "source/dcompute/driver/README.md",
    "chars": 2129,
    "preview": "dcompute.driver\n===============\n\nContains the abstracted driver interface for dcompute. It contains a delegation \nlayer "
  },
  {
    "path": "source/dcompute/driver/backend.d",
    "chars": 78,
    "preview": "module dcompute.driver.backend;\n\nenum Backend\n{\n    OpenCL120,\n    CUDA650,\n}\n"
  },
  {
    "path": "source/dcompute/driver/cuda/TODO",
    "chars": 63,
    "preview": "cuLink.*\ncuIpc.*\ncuTexRef.*\ncuTexObj.*\ncuSurfRef.*\ncuSurfObj.*\n"
  },
  {
    "path": "source/dcompute/driver/cuda/buffer.d",
    "chars": 1070,
    "preview": "module dcompute.driver.cuda.buffer;\n\nimport dcompute.driver.cuda;\n\nstruct Buffer(T)\n{\n    size_t raw;\n\n\t// Host memory a"
  },
  {
    "path": "source/dcompute/driver/cuda/context.d",
    "chars": 2406,
    "preview": "module dcompute.driver.cuda.context;\n\nimport dcompute.driver.cuda;\n\nstruct Context\n{\n    CUcontext raw;\n    this(Device "
  },
  {
    "path": "source/dcompute/driver/cuda/device.d",
    "chars": 5074,
    "preview": "module dcompute.driver.cuda.device;\n\nimport dcompute.driver.cuda;\n\nstruct Device\n{\n    int raw;\n    //struct CUdevprop\n "
  },
  {
    "path": "source/dcompute/driver/cuda/event.d",
    "chars": 105,
    "preview": "module dcompute.driver.cuda.event;\n\nimport dcompute.driver.cuda;\n\nstruct Event\n{\n    CUevent raw;\n    \n}\n"
  },
  {
    "path": "source/dcompute/driver/cuda/kernel.d",
    "chars": 457,
    "preview": "module dcompute.driver.cuda.kernel;\n\nimport dcompute.driver.cuda;\nstruct Kernel(F) if (is(F==function)|| is(F==void))\n{\n"
  },
  {
    "path": "source/dcompute/driver/cuda/memory.d",
    "chars": 1131,
    "preview": "module dcompute.driver.cuda.memory;\n\nimport dcompute.driver.error;\nimport dcompute.driver.cuda;\n\n// void pointer like\nst"
  },
  {
    "path": "source/dcompute/driver/cuda/package.d",
    "chars": 1134,
    "preview": "module dcompute.driver.cuda;\n\npublic import ldc.dcompute;\npublic import bindbc.cuda;\n\npublic import dcompute.driver.erro"
  },
  {
    "path": "source/dcompute/driver/cuda/platform.d",
    "chars": 972,
    "preview": "module dcompute.driver.cuda.platform;\n\nimport dcompute.driver.error;\nimport dcompute.driver.cuda;\nimport std.experimenta"
  },
  {
    "path": "source/dcompute/driver/cuda/program.d",
    "chars": 1227,
    "preview": "module dcompute.driver.cuda.program;\n\nimport dcompute.driver.cuda;\n\nimport std.string;\nstruct Program\n{\n    CUmodule raw"
  },
  {
    "path": "source/dcompute/driver/cuda/queue.d",
    "chars": 2763,
    "preview": "// A stream in CUDA speak\nmodule dcompute.driver.cuda.queue;\n\nimport dcompute.driver.cuda;\nstruct Queue\n{\n    CUstream r"
  },
  {
    "path": "source/dcompute/driver/cuda/unified_buffer.d",
    "chars": 5259,
    "preview": "/**\n * Unified Memory (Managed Memory) buffer for CUDA.\n *\n * A UnifiedBuffer!T allocates memory that is accessible from"
  },
  {
    "path": "source/dcompute/driver/error.d",
    "chars": 7661,
    "preview": "/**/\n\nmodule dcompute.driver.error;\n\n// Helpfully OpenCL errors are negative and CUDAs are positive\nenum Status : int {\n"
  },
  {
    "path": "source/dcompute/driver/ocl/buffer.d",
    "chars": 259,
    "preview": "module dcompute.driver.ocl.buffer;\n\nimport dcompute.driver.ocl;\n\nstruct Buffer(T)\n{\n    cl_mem raw;\n\n    // Host memory "
  },
  {
    "path": "source/dcompute/driver/ocl/context.d",
    "chars": 4817,
    "preview": "module dcompute.driver.ocl.context;\n\nimport dcompute.driver.ocl;\nimport std.typecons;\n\nimport std.experimental.allocator"
  },
  {
    "path": "source/dcompute/driver/ocl/device.d",
    "chars": 6524,
    "preview": "module dcompute.driver.ocl.device;\n\nimport derelict.opencl.cl;\nimport dcompute.driver.ocl;\nimport std.meta: AliasSeq;\n\ns"
  },
  {
    "path": "source/dcompute/driver/ocl/event.d",
    "chars": 2242,
    "preview": "module dcompute.driver.ocl.event;\n\nimport dcompute.driver.ocl;\n\nstruct Event\n{\n    cl_event raw;\n    enum EnqueuedComman"
  },
  {
    "path": "source/dcompute/driver/ocl/image.d",
    "chars": 1876,
    "preview": "module dcompute.driver.ocl.image;\n\nimport dcompute.driver.ocl;\nstruct Image\n{\n    cl_mem raw;\n    \n    enum ChannelOrder"
  },
  {
    "path": "source/dcompute/driver/ocl/kernel.d",
    "chars": 2542,
    "preview": "module dcompute.driver.ocl.kernel;\n\nimport dcompute.driver.ocl;\n\nstruct Kernel(F) if (is(F == function) || is(F==void))\n"
  },
  {
    "path": "source/dcompute/driver/ocl/memory.d",
    "chars": 1629,
    "preview": "module dcompute.driver.ocl.memory;\n\nimport dcompute.driver.ocl;\n\nstruct Memory\n{\n    enum Type\n    {\n        buffer     "
  },
  {
    "path": "source/dcompute/driver/ocl/package.d",
    "chars": 610,
    "preview": "module dcompute.driver.ocl;\n\npublic import dcompute.driver.error;\n\npublic import dcompute.driver.ocl.buffer;\npublic impo"
  },
  {
    "path": "source/dcompute/driver/ocl/platform.d",
    "chars": 1984,
    "preview": "module dcompute.driver.ocl.platform;\n\nimport dcompute.driver.ocl;\nimport std.experimental.allocator.typed;\nimport std.me"
  },
  {
    "path": "source/dcompute/driver/ocl/program.d",
    "chars": 2434,
    "preview": "module dcompute.driver.ocl.program;\n\nimport dcompute.driver.ocl;\nimport std.meta: AliasSeq;\nimport std.string : toString"
  },
  {
    "path": "source/dcompute/driver/ocl/queue.d",
    "chars": 4139,
    "preview": "module dcompute.driver.ocl.queue;\n\nimport dcompute.driver.ocl;\nimport dcompute.driver.util;\nimport std.typecons;\n\nenum M"
  },
  {
    "path": "source/dcompute/driver/ocl/raw/enums.d",
    "chars": 3170,
    "preview": "module dcompute.driver.ocl.raw.enums;\n\nimport dcompute.driver.ocl;\n\nenum //: profiling_info\n{\n    PROFILING_COMMAND_QUEU"
  },
  {
    "path": "source/dcompute/driver/ocl/raw/functions.d",
    "chars": 14913,
    "preview": "module dcompute.driver.ocl.raw.functions;\n\n//This is an autogenerated file, do not edit\n\n\nimport dcompute.driver.ocl;\n//"
  },
  {
    "path": "source/dcompute/driver/ocl/raw/package.d",
    "chars": 161,
    "preview": "module dcompute.driver.ocl.raw;\n\npublic import dcompute.driver.ocl.raw.functions;\npublic import dcompute.driver.ocl.raw."
  },
  {
    "path": "source/dcompute/driver/ocl/sampler.d",
    "chars": 955,
    "preview": "module dcompute.driver.ocl.sampler;\n\nimport dcompute.driver.ocl;\nstruct Sampler\n{\n    enum FilterMode\n    {\n        near"
  },
  {
    "path": "source/dcompute/driver/ocl/util.d",
    "chars": 5224,
    "preview": "module dcompute.driver.ocl.util;\n\nimport std.range;\nimport std.meta;\nimport std.traits;\n\n//deal with arrays seperately, "
  },
  {
    "path": "source/dcompute/driver/util.d",
    "chars": 484,
    "preview": "module dcompute.driver.util;\n\nimport std.traits;\nimport std.meta;\nimport ldc.dcompute : Pointer;\nimport dcompute.driver."
  },
  {
    "path": "source/dcompute/kernels/README.md",
    "chars": 875,
    "preview": "Algorithms\n==========\n\nAdjacent\n\nExample use\n===========\n\nIdeally we want to be able to do something like\n```D\nwith(kern"
  },
  {
    "path": "source/dcompute/kernels/package.d",
    "chars": 194,
    "preview": "module dcompute.kernels;\n/*Adjacent:\n * adjacent!(R,alias e)(R r, R o) where e a is binary op to apply to adjacent eleme"
  },
  {
    "path": "source/dcompute/std/atomic.d",
    "chars": 1689,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.atomic;\n\nimport ldc.dcompute;\n\nimport cuda = dcompute.std.cuda.atomi"
  },
  {
    "path": "source/dcompute/std/atomic_common.d",
    "chars": 164,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.atomic_common;\n\nimport ldc.dcompute;\n\nenum MemoryOrder {\n\trelaxed, \n"
  },
  {
    "path": "source/dcompute/std/cuda/atomic.d",
    "chars": 5938,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.cuda.atomic;\n\nimport ldc.dcompute;\nimport dcompute.std.atomic_common"
  },
  {
    "path": "source/dcompute/std/cuda/index.d",
    "chars": 1145,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.cuda.index;\n\nimport ldc.dcompute;\npure: nothrow: @nogc:\n//tid = thre"
  },
  {
    "path": "source/dcompute/std/cuda/sync.d",
    "chars": 790,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.cuda.sync;\n\nimport ldc.dcompute;\nimport ldc.intrinsics;\n\npragma(LDC_"
  },
  {
    "path": "source/dcompute/std/floating.d",
    "chars": 530,
    "preview": "@compute(CompileFor.hostAndDevice) module dcompute.std.floating;\n\nimport ldc.dcompute;\n\n/*\n *Intrinsic\n * isfinite\n * is"
  },
  {
    "path": "source/dcompute/std/index.d",
    "chars": 9689,
    "preview": "@compute(CompileFor.hostAndDevice) module dcompute.std.index;\n\nimport ldc.dcompute;\n\nprivate import ocl  = dcompute.std."
  },
  {
    "path": "source/dcompute/std/integer.d",
    "chars": 543,
    "preview": "@compute(CompileFor.hostAndDevice) module dcompute.std.integer;\n\nimport ldc.dcompute;\n\n/*\n brev - bit reverse\n sad  - su"
  },
  {
    "path": "source/dcompute/std/memory.d",
    "chars": 680,
    "preview": "@compute(CompileFor.hostAndDevice) module dcompute.std.memory;\n\nimport ldc.dcompute;\n\n/*\n *Pointer conversions:\n * *Poin"
  },
  {
    "path": "source/dcompute/std/opencl/image.d",
    "chars": 11183,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.opencl.image;\n\nimport ldc.dcompute;\n//separate module for opaque ima"
  },
  {
    "path": "source/dcompute/std/opencl/index.d",
    "chars": 3497,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.opencl.index;\n\nimport ldc.dcompute;\n\npure:\nnothrow:\n@nogc:\n\n// These"
  },
  {
    "path": "source/dcompute/std/opencl/math.d",
    "chars": 123629,
    "preview": "/**\nProvides access to the OpenCL C math functions and constants.\n\nThese functions are only callable from opencl kernels"
  },
  {
    "path": "source/dcompute/std/opencl/sync.d",
    "chars": 4553,
    "preview": "/++\nProvides access to the OpenCL C sync functions.\nSee_Also: [6.15.8. Synchronization Functions](https://registry.khron"
  },
  {
    "path": "source/dcompute/std/pack.d",
    "chars": 1033,
    "preview": "@compute(CompileFor.hostAndDevice) module dcompute.std.pack;\n\nimport ldc.dcompute;\n//Unpacking functions\n/*\nfloat4 unorm"
  },
  {
    "path": "source/dcompute/std/package.d",
    "chars": 162,
    "preview": "module dcompute.std;\n\nversion(LDC_DCompute) {}\nelse\n{\n    static assert(false, \"Need to use a DCompute enabled compiler."
  },
  {
    "path": "source/dcompute/std/sync.d",
    "chars": 1029,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.sync;\n\nimport ldc.dcompute;\nimport ldc.intrinsics;\n\nimport ocl  = dc"
  },
  {
    "path": "source/dcompute/std/warp.d",
    "chars": 587,
    "preview": "@compute(CompileFor.deviceOnly) module dcompute.std.warp;\n\nimport ldc.dcompute;\n/*Warp functions\n *Vote:\n * int  any(int"
  },
  {
    "path": "source/dcompute/tests/dummykernels.d",
    "chars": 595,
    "preview": "@compute(CompileFor.deviceOnly)\nmodule dcompute.tests.dummykernels;\npragma(LDC_no_moduleinfo);\n\nimport ldc.dcompute;\nimp"
  },
  {
    "path": "source/dcompute/tests/main.d",
    "chars": 6069,
    "preview": "version (DComputeTesting) {\n    version = DComputeTestCUDA;\n}\n\n//import dcompute.tests.test;\n\nimport std.algorithm;\nimpo"
  },
  {
    "path": "source/dcompute/tests/test.d",
    "chars": 247,
    "preview": "@compute(CompileFor.deviceOnly)\nmodule dcompute.tests.test;\n\nimport ldc.dcompute;\nimport dcompute.std.index;\nimport std."
  }
]

About this extraction

This page contains the full source code of the libmir/dcompute GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 66 files (280.3 KB), approximately 86.3k tokens. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!