[
  {
    "path": ".gitignore",
    "content": "bin/\nobj/\n"
  },
  {
    "path": "LICENSE",
    "content": "The MIT License (MIT)\n\nCopyright (c) 2014 SURFsara\n\nPermission is hereby granted, free of charge, to any person obtaining a copy\nof this software and associated documentation files (the \"Software\"), to deal\nin the Software without restriction, including without limitation the rights\nto use, copy, modify, merge, publish, distribute, sublicense, and/or sell\ncopies of the Software, and to permit persons to whom the Software is\nfurnished to do so, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in\nall copies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN\nTHE SOFTWARE.\n"
  },
  {
    "path": "Makefile",
    "content": "\n# ==================================================================================================\n# Project: \n# Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n#\n# File information:\n# Institution.... SURFsara <www.surfsara.nl>\n# Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n# Changed at..... 2014-11-07\n# License........ MIT license\n# Tab-size....... 4 spaces\n# Line length.... 100 characters\n#\n# ==================================================================================================\n\n# Set the location of CUDA, OpenCL and clBlas\nCUDADIR = $(CUDA_HOME)\nOPENCLDIR = $(CUDA_HOME)\nCLBLASDIR = $(CLBLAS_HOME)\n\n# Disable all CUDA components (including cuBLAS) in the code to run on a non-NVIDIA system\nENABLE_CUDA = 1\n\n# ==================================================================================================\n\n# Compilers\nCXX = g++\nNVCC = nvcc\n\n# Compiler flags\nCXXFLAGS += -O3 -Wall\nNVFLAGS += -O3 -arch=sm_35 -Xcompiler -Wall\n#NVFLAGS += -maxrregcount 127\n\n# Folders\nSRCDIR = src\nBINDIR = bin\nOBJDIR = obj\nSCRDIR = scripts\n\n# Disable/enable CUDA in the C++ code\nifeq ($(ENABLE_CUDA),1)\n\tDEFINES += -DENABLE_CUDA\nendif\n\n# Load OpenCL and the clBlas library\nINCLUDES += -I$(OPENCLDIR)/include -I$(CLBLASDIR)/include\nLDFLAGS += -L$(OPENCLDIR)/lib64 -L$(CLBLASDIR)/lib64\nLDFLAGS += -lOpenCL -lclBLAS\n\n# Load CUDA and the cuBLAS library\nifeq ($(ENABLE_CUDA),1)\n\tINCLUDES += -I$(CUDADIR)/include\n\tLDFLAGS += -L$(CUDADIR)/lib64\n\tLDFLAGS += -lcuda -lcudart -lcublas\nendif\n\n# Set the source files\nCPPSOURCES = main.cpp clGEMM.cpp libclblas.cpp\nGPUSOURCES = cuGEMM.cu libcublas.cu\n\n# Define the names of the object files and the binary\nOBJS = $(CPPSOURCES:%.cpp=$(OBJDIR)/%.cpp.o)\nifeq ($(ENABLE_CUDA),1)\n\tOBJS +=  $(GPUSOURCES:%.cu=$(OBJDIR)/%.cu.o)\nendif\nBIN = $(BINDIR)/myGEMM\n\n# ==================================================================================================\n\n# All (default target)\nall: build run\n\n# Build the binary from the objects\nbuild: $(OBJS)\n\t@mkdir -p $(BINDIR)\n\t$(CXX) $(CXXFLAGS) $(DEFINES) $(INCLUDES) $(OBJS) $(LDFLAGS) -o $(BIN)\n\n# C++ sources\n$(OBJDIR)/%.cpp.o: $(SRCDIR)/%.cpp $(SRCDIR)/*.h\n\t@mkdir -p $(OBJDIR)\n\t$(CXX) -c $(CXXFLAGS) $(DEFINES) $(INCLUDES) $< -o $@\n\n# CUDA sources\n$(OBJDIR)/%.cu.o: $(SRCDIR)/%.cu $(SRCDIR)/*.h $(SRCDIR)/*.cl\n\t@mkdir -p $(OBJDIR)\n\t$(NVCC) -c $(NVFLAGS) $(DEFINES) $(INCLUDES) $< -o $@\n\n# Generate assembly code from the kernels and print some statistics\ninspect:\n\t$(NVCC) -cubin $(NVFLAGS) -Xptxas -v $(INCLUDES) $(SRCDIR)/cuGEMM.cu -o $(BIN).cu.cubin\n\tnvdisasm -lrm narrow $(BIN).cu.cubin > $(BIN).cu.asm\n\tcuobjdump $(BIN) -xptx cuGEMM\n\tmv cuGEMM.sm_35.ptx $(BIN).cu.ptx\n\tcuobjdump $(BIN) -sass > $(BIN).cu.sass\n\tsh $(SCRDIR)/stats.sh $(BIN).cu.sass\n\n# Execute the binary\nrun:\n\t./$(BIN)\n\n# Clean-up\nclean:\n\trm -f $(OBJDIR)/*.o\n\trm -f $(BIN)\n\trm -f $(BIN).*\n\n# ==================================================================================================\n\n.PHONY: run inspect clean\n\n# ==================================================================================================\n"
  },
  {
    "path": "README.md",
    "content": "\nExploring the performance of SGEMM in OpenCL on NVIDIA GPUs\n=============\n\nDate: 31-Oct-2014 - 07-Nov-2014\n\nAuthor: Cedric Nugteren, SURFsara (http://www.surfsara.nl)\n\nThis repository contains multiple OpenCL implementations of single-precision generalised matrix-multiplication (SGEMM) tuned for an NVIDIA Tesla K40m GPU. The different versions (named myGEMM) are part of a step-by-step tutorial, in which each step adds a new optimisation. The different steps and the details of the OpenCL kernel codes are all explained in depth at https://cnugteren.github.io/tutorial/pages/page1.html.\n\nThe OpenCL kernels can be used natively using the OpenCL framework. However, there is also a header-file included which converts the OpenCL kernels into CUDA syntax. This allows the same code to be tested through the CUDA-toolchain.\n\nApart from the OpenCL kernel codes, this repository contains fully working host code, including a loop over different matrix sizes and different BLAS libraries. It contains code to run NVIDIA's cuBLAS as a reference and the open-source clBlas library.\n\nPre-requisites:\n* A C++ compiler (tested with GCC and ICC)\n* The CUDA toolkit and NVCC compiler (tested with version 6.5)\n* OpenCL headers and libraries (part of the CUDA toolkit)\n\nRequirements to run the performance and correctness comparisons:\n* The cuBLAS library (part of the CUDA toolkit, tested version 6.5)\n* The open-source clBlas library (tested 2.2.0)\n\nUsage\n=============\n\n*\tCompile the code:\n\n\t\tmake build\n\n\tCompiles the benchmarking infrastructure and the myGEMM kernels. Make sure there is a \"bin\" and \"obj\" directory available. Note that you might have to edit the Makefile to set the proper locations of the CUDA and OpenCL installations on your system.\n\n*\tRun the code:\n\n\t\tmake run\n\n\tThis runs the code for matrices ranging from MINSIZE to MAXSIZE (defined in src/common.h). It will run cuBLAS, clBlas, and the CUDA and OpenCL versions of the myGEMM kernels. The particular kernel to be executed is defined using the KERNEL keyword in src/settings.h. This file also contains other settings you might want to modify for your particular GPU.\n\n*\tInspect the code:\n\n\t\tmake inspect\n\n\tThis generates all kinds of assembly-like versions of the CUDA kernels in the \"bin\" subdirectory. It also prints out statistics of the kernels such as the register usage.\n\nMinimal working example\n=============\n\nAdditionally, we supply the minimal.cpp file in the 'extra' directory. This file is a self-contained minimal working example (MWE) of the most basic SGEMM kernel (myGEMM1). This can be useful if you don't want to deal with Makefiles or don't have the CUDA, cuBLAS, or clBlas installed. Note that minimal.cpp misses some features compared to the main code, but we believe that it can nevertheless be a good starting point if you want to integrate myGEMM into your own code.\n\nThe code can be compiled using a regular C++ compiler and only requires OpenCL installed. Example compilation from the root folder:\n\n\tg++ -O3 -Wall -I/path/to/opencl/include extra/minimal.cpp -o bin/minimal -lOpenCL\n\nBe aware that the minimal working example does not:\n*\tIterate over multiple matrix sizes\n*\tCompare performance with cuBLAS or clBlas\n*\tCheck for correctness of the results\n*\tCheck for OpenCL errors\n*\tLoad a kernel-file from disk, instead it is embedded as a string\n\n###################################################\n"
  },
  {
    "path": "extra/minimal.cpp",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-07\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// Compilation example:\n// g++ -O3 -I$OPENCL_DIR/include minimal.cpp -o minimal -lOpenCL\n// \n// =================================================================================================\n\n// Includes\n#include <stdio.h>\n#include <sys/time.h>\n#include <CL/cl.h>\n\n// =================================================================================================\n\n// Repeat all kernels multiple times to get an average timing result\n#define NUM_RUNS 2\n\n// Size of the matrices - K, M, N (squared)\n#define SIZE 4096\n\n// Threadblock sizes (e.g. for kernels myGEMM1 or myGEMM2)\n#define TS 32\n\n// =================================================================================================\n\n// Set the kernel as a string (better to do this in a separate file though)\nconst char *kernelstring =\n    \"__kernel void myGEMM1(const int M, const int N, const int K,\"\n    \"                      const __global float* A,\"\n    \"                      const __global float* B,\"\n    \"                      __global float* C) {\"\n    \"    const int globalRow = get_global_id(0);\"\n    \"    const int globalCol = get_global_id(1);\"\n    \"    float acc = 0.0f;\"\n    \"    for (int k=0; k<K; k++) {\"\n    \"        acc += A[k*M + globalRow] * B[globalCol*K + k];\"\n    \"    }\"\n    \"    C[globalCol*M + globalRow] = acc;\"\n    \"}\";\n\n// =================================================================================================\n\n// Matrix-multiplication using a custom OpenCL SGEMM kernel.\nint main(int argc, char* argv[]) {\n\n    // Timers\n    struct timeval Tvalue;\n    struct timezone dummy;\n\n    // Set the sizes\n    int K = SIZE;\n    int M = SIZE;\n    int N = SIZE;\n\n    // Create the matrices and initialize them with random values\n    float* A = (float*)malloc(M*K*sizeof(float*));\n    float* B = (float*)malloc(K*N*sizeof(float*));\n    float* C = (float*)malloc(M*N*sizeof(float*));\n    for (int i=0; i<M*K; i++) { A[i] = 3.6*i + i*i + 3.1; }\n    for (int i=0; i<K*N; i++) { B[i] = 1.2*i + 0.01*i*i + 13.9; }\n    for (int i=0; i<M*N; i++) { C[i] = 0.0; }\n\n    // Configure the OpenCL environment\n    printf(\">>> Initializing OpenCL...\\n\");\n    cl_platform_id platform = 0;\n    clGetPlatformIDs(1, &platform, NULL);\n    cl_device_id device = 0;\n    clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);\n    cl_context context = clCreateContext(NULL, 1, &device, NULL, NULL, NULL);\n    cl_command_queue queue = clCreateCommandQueue(context, device, 0, NULL);\n    char deviceName[1024];\n    clGetDeviceInfo(device, CL_DEVICE_NAME, 1024, deviceName, NULL);\n    cl_event event = NULL;\n\n    // Compile the kernel\n    cl_program program = clCreateProgramWithSource(context, 1, &kernelstring, NULL, NULL);\n    clBuildProgram(program, 0, NULL, \"\", NULL, NULL);\n\n    // Check for compilation errors\n    size_t logSize;\n    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &logSize);\n    char* messages = (char*)malloc((1+logSize)*sizeof(char));\n    clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, logSize, messages, NULL);\n    messages[logSize] = '\\0';\n    if (logSize > 10) { printf(\">>> Compiler message: %s\\n\", messages); }\n    free(messages);\n\n    // Prepare OpenCL memory objects\n    cl_mem bufA = clCreateBuffer(context, CL_MEM_READ_ONLY,  M*K*sizeof(float), NULL, NULL);\n    cl_mem bufB = clCreateBuffer(context, CL_MEM_READ_ONLY,  K*N*sizeof(float), NULL, NULL);\n    cl_mem bufC = clCreateBuffer(context, CL_MEM_READ_WRITE, M*N*sizeof(float), NULL, NULL);\n\n    // Copy matrices to the GPU\n    clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, M*K*sizeof(float), A, 0, NULL, NULL);\n    clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0, K*N*sizeof(float), B, 0, NULL, NULL);\n    clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0, M*N*sizeof(float), C, 0, NULL, NULL);\n\n    // Configure the myGEMM kernel and set its arguments\n    cl_kernel kernel = clCreateKernel(program, \"myGEMM1\", NULL);\n    clSetKernelArg(kernel, 0, sizeof(int), (void*)&M);\n    clSetKernelArg(kernel, 1, sizeof(int), (void*)&N);\n    clSetKernelArg(kernel, 2, sizeof(int), (void*)&K);\n    clSetKernelArg(kernel, 3, sizeof(cl_mem), (void*)&bufA);\n    clSetKernelArg(kernel, 4, sizeof(cl_mem), (void*)&bufB);\n    clSetKernelArg(kernel, 5, sizeof(cl_mem), (void*)&bufC);\n\n    // Start the timed loop\n    printf(\">>> Starting %d myGEMM runs...\\n\", NUM_RUNS);\n    gettimeofday(&Tvalue, &dummy);\n    double starttime = (double)Tvalue.tv_sec + 1.0e-6*((double)Tvalue.tv_usec);\n    for (int r=0; r<NUM_RUNS; r++) {\n\n        // Run the myGEMM kernel\n        const size_t local[2] = { TS, TS };\n        const size_t global[2] = { M, N };\n        clEnqueueNDRangeKernel(queue, kernel, 2, NULL, global, local, 0, NULL, &event);\n\n        // Wait for calculations to be finished\n        clWaitForEvents(1, &event);\n    }\n\n    // End the timed loop\n    gettimeofday(&Tvalue, &dummy);\n    double endtime = (double)Tvalue.tv_sec + 1.0e-6*((double)Tvalue.tv_usec);\n    double runtime = (endtime - starttime) / (double)NUM_RUNS;\n    double gflop = ((long)K * (long)M * (long)N * 2) / (1000*1000*1000);\n    printf(\">>> Done: took %.3lf seconds per run, %.1lf GFLOPS\\n\", runtime, gflop/runtime);\n\n    // Copy the output matrix C back to the CPU memory\n    clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0, M*N*sizeof(float), C, 0, NULL, NULL);\n\n    // Free the OpenCL memory objects\n    clReleaseMemObject(bufA);\n    clReleaseMemObject(bufB);\n    clReleaseMemObject(bufC);\n\n    // Clean-up OpenCL \n    clReleaseCommandQueue(queue);\n    clReleaseContext(context);\n    clReleaseProgram(program);\n    clReleaseKernel(kernel);\n\n    // Free the host memory objects\n    free(A);\n    free(B);\n    free(C);\n\n    // Exit\n    return 0;\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "scripts/stats.sh",
    "content": "#!/bin/bash\n\n# ==================================================================================================\n# Project: \n# Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n#\n# File information:\n# Institution.... SURFsara <www.surfsara.nl>\n# Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n# Changed at..... 2014-10-30\n# License........ MIT license\n# Tab-size....... 4 spaces\n# Line length.... 100 characters\n#\n# ==================================================================================================\n\n# Read the filename from the command-line\nfile=$1\n\n# Calculate occurences of particular instructions in the assembly\nFFMA=`cat $file | grep -c \"FFMA\"`\nLDS=`cat $file | grep -c \"LDS\"`\nSTS=`cat $file | grep -c \"STS\"`\nSHFL=`cat $file | grep -c \"SHFL\"`\nLD=`cat $file | grep -c \"LD[^S]\"`\nST=`cat $file | grep -c \"ST[^S]\"`\nMOV=`cat $file | grep -c \"MOV\"`\nSUM=$((FFMA+LDS+STS+SHFL+LD+ST+MOV+SUM))\n\n# Print the resulting statistics to screen\necho \">> Stats on $file:\"\necho \">> \"\necho \">> FFMA  $FFMA\"\necho \">> LDS   $LDS\"\necho \">> STS   $STS\"\necho \">> SHFL  $SHFL\"\necho \">> LD    $LD\"\necho \">> ST    $ST\"\necho \">> MOV   $MOV\"\necho \">> \"\necho \">> TOTAL=$SUM\"\n\n# ==================================================================================================\n"
  },
  {
    "path": "src/clGEMM.cpp",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-17\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Common include\n#include \"common.h\"\n\n// Include OpenCL \n#include <CL/cl.h>\n\n// Include kernel constants\n#include \"settings.h\"\n\n// Forward declaration of the OpenCL error checking function\nvoid checkError(cl_int error, int line);\n\n// =================================================================================================\n\n// Set the locations of the OpenCL kernel files\n#define CL_INCLUDE_FILE \"src/settings.h\"\n#define CL_KERNEL_FILE \"src/kernels.cl\"\n\n// Determine the location where to output the PTX code\n#define CL_PTX_FILE \"bin/myGEMM.cl.ptx\"\n\n// Define OpenCL compiler options, such as \"-cl-nv-maxrregcount=127\"\n#define COMPILER_OPTIONS \"\"\n\n// =================================================================================================\n\n// Matrix-multiplication using a custom OpenCL SGEMM kernel. This function also copies the input\n// matrices to the GPU, runs SGEMM, and copies the output matrix back to the CPU.\nvoid myclblas(float* A, float* B, float* C,\n              int K, int M, int N,\n              int timerID) {\n\n    // In case of myGEMM10, compute matrix sizes K, M, N as rounded-up to form complete tiles\n    #if KERNEL == 10\n        int K_XL = CEIL_DIV(K, TSK) * TSK;\n        int M_XL = CEIL_DIV(M, TSM) * TSM;\n        int N_XL = CEIL_DIV(N, TSN) * TSN;\n    #else\n        int K_XL = K;\n        int M_XL = M;\n        int N_XL = N;\n    #endif\n\n    // Define OpenCL variables\n    cl_int err;\n    cl_platform_id platform = 0;\n    cl_device_id device = 0;\n    cl_device_id devices[MAX_NUM_DEVICES];\n    cl_uint numDevices = 0;\n    cl_context_properties props[3] = {CL_CONTEXT_PLATFORM, 0, 0};\n    cl_context context = 0;\n    cl_command_queue queue = 0;\n    cl_event event = NULL;\n    cl_program program = NULL;\n    char deviceName[MAX_DEVICE_NAME];\n\n    // Configure the OpenCL environment\n    err = clGetPlatformIDs(1, &platform, NULL);\n    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);\n    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);\n    device = devices[CURRENT_DEVICE];\n    props[1] = (cl_context_properties)platform;\n    context = clCreateContext(props, 1, &device, NULL, NULL, &err);\n    queue = clCreateCommandQueue(context, device, 0, &err);\n    err = clGetDeviceInfo(device, CL_DEVICE_NAME, MAX_DEVICE_NAME, deviceName, NULL);\n    checkError(err,__LINE__);\n    //printf(\"## %d devices, running on %d: '%s'\\n\", numDevices, CURRENT_DEVICE, deviceName);\n\n    // Read the kernel file from disk\n    long sizeHeader, sizeSource;\n    char* header = readKernelFile(CL_INCLUDE_FILE, &sizeHeader);\n    char* source = readKernelFile(CL_KERNEL_FILE, &sizeSource);\n    long size = 2 + sizeHeader + sizeSource;\n    char* code = (char*)malloc(size*sizeof(char));\n    for (int c=0; c<size; c++) { code[c] = '\\0'; }\n    strcat(code, header);\n    strcat(code, source);\n    const char* constCode = code;\n    free(header);\n    free(source);\n\n    // Compile the kernel file\n    program = clCreateProgramWithSource(context, 1, &constCode, NULL, &err);\n    checkError(err,__LINE__);\n    err = clBuildProgram(program, 0, NULL, COMPILER_OPTIONS, NULL, NULL);\n\n    // Check for compilation errors\n    size_t logSize;\n    err = clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, 0, NULL, &logSize);\n    checkError(err,__LINE__);\n    char* messages = (char*)malloc((1+logSize)*sizeof(char));\n    err = clGetProgramBuildInfo(program, device, CL_PROGRAM_BUILD_LOG, logSize, messages, NULL);\n    checkError(err,__LINE__);\n    messages[logSize] = '\\0';\n    //if (logSize > 10) { printf(\"## Compiler message: %s\\n\", messages); }\n    free(messages);\n\n    // Retrieve the PTX code from the OpenCL compiler and output it to disk\n    size_t binSize;\n    err = clGetProgramInfo(program, CL_PROGRAM_BINARY_SIZES, sizeof(size_t), &binSize, NULL);\n    checkError(err,__LINE__);\n    unsigned char *bin = (unsigned char *)malloc(binSize);\n    err = clGetProgramInfo(program, CL_PROGRAM_BINARIES, sizeof(unsigned char *), &bin, NULL);\n    checkError(err,__LINE__);\n    FILE* file = fopen(CL_PTX_FILE, \"wb\");\n    fwrite(bin, sizeof(char), binSize, file);\n    fclose(file);\n    free(bin);\n\n    // Prepare OpenCL memory objects\n    cl_mem bufA    = clCreateBuffer(context, CL_MEM_READ_ONLY,  M*K*sizeof(*A), NULL, &err);\n    cl_mem bufB    = clCreateBuffer(context, CL_MEM_READ_ONLY,  K*N*sizeof(*B), NULL, &err);\n    cl_mem bufB_TR = clCreateBuffer(context, CL_MEM_READ_ONLY,  N*K*sizeof(*B), NULL, &err);\n    cl_mem bufC    = clCreateBuffer(context, CL_MEM_READ_WRITE, M*N*sizeof(*C), NULL, &err);\n    checkError(err,__LINE__);\n\n    // Copy matrices to the GPU (also C to erase the results of the previous run)\n    err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, M*K*sizeof(*A), A, 0, NULL, NULL);\n    err = clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0, K*N*sizeof(*B), B, 0, NULL, NULL);\n    err = clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0, M*N*sizeof(*C), C, 0, NULL, NULL);\n    checkError(err,__LINE__);\n\n    // Create extra objects for rounded-up sizes (only needed in case of myGEMM10)\n    cl_mem bufA_XL    = clCreateBuffer(context, CL_MEM_READ_ONLY,  M_XL*K_XL*sizeof(*A), NULL, &err);\n    cl_mem bufB_TR_XL = clCreateBuffer(context, CL_MEM_READ_ONLY,  N_XL*K_XL*sizeof(*B), NULL, &err);\n    cl_mem bufC_XL    = clCreateBuffer(context, CL_MEM_READ_WRITE, M_XL*N_XL*sizeof(*C), NULL, &err);\n    checkError(err,__LINE__);\n\n    // Configure the myGEMM kernel\n    char kernelname[100];\n    sprintf(kernelname, \"myGEMM%d\", KERNEL);\n    cl_kernel kernel1 = clCreateKernel(program, kernelname, &err);\n    checkError(err,__LINE__);\n\n    // Set the arguments of the myGEMM kernel\n    #if KERNEL == 10\n        err = clSetKernelArg(kernel1, 0, sizeof(int), (void*)&M_XL);\n        err = clSetKernelArg(kernel1, 1, sizeof(int), (void*)&N_XL);\n        err = clSetKernelArg(kernel1, 2, sizeof(int), (void*)&K_XL);\n        err = clSetKernelArg(kernel1, 3, sizeof(cl_mem), (void*)&bufA_XL);\n        err = clSetKernelArg(kernel1, 4, sizeof(cl_mem), (void*)&bufB_TR_XL);\n        err = clSetKernelArg(kernel1, 5, sizeof(cl_mem), (void*)&bufC_XL);\n    #else\n        err = clSetKernelArg(kernel1, 0, sizeof(int), (void*)&M);\n        err = clSetKernelArg(kernel1, 1, sizeof(int), (void*)&N);\n        err = clSetKernelArg(kernel1, 2, sizeof(int), (void*)&K);\n        err = clSetKernelArg(kernel1, 3, sizeof(cl_mem), (void*)&bufA);\n        #if KERNEL == 5 || KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9\n            err = clSetKernelArg(kernel1, 4, sizeof(cl_mem), (void*)&bufB_TR);\n        #else\n            err = clSetKernelArg(kernel1, 4, sizeof(cl_mem), (void*)&bufB);\n        #endif\n        err = clSetKernelArg(kernel1, 5, sizeof(cl_mem), (void*)&bufC);\n    #endif\n    checkError(err,__LINE__);\n\n    // Configure the supporting transpose kernel and set its arguments (only for certain myGEMMs)\n    #if KERNEL == 5 || KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9 || KERNEL == 10\n        cl_kernel kernel2 = clCreateKernel(program, \"transpose\", &err);\n        checkError(err,__LINE__);\n        err = clSetKernelArg(kernel2, 0, sizeof(int), (void*)&K);\n        err = clSetKernelArg(kernel2, 1, sizeof(int), (void*)&N);\n        err = clSetKernelArg(kernel2, 2, sizeof(cl_mem), (void*)&bufB);\n        err = clSetKernelArg(kernel2, 3, sizeof(cl_mem), (void*)&bufB_TR);\n        checkError(err,__LINE__);\n        const size_t tLocal[2] = { TRANSPOSEX, TRANSPOSEY };\n        const size_t tGlobal[2] = { (size_t)K, (size_t)N };\n    #endif\n\n    // Configure the supporting padding kernels and set their arguments (only for myGEMM10)\n    #if KERNEL == 10\n        cl_kernel kernel3a = clCreateKernel(program, \"paddingAddZeroes\", &err);\n        checkError(err,__LINE__);\n        err = clSetKernelArg(kernel3a, 0, sizeof(int), (void*)&M);\n        err = clSetKernelArg(kernel3a, 1, sizeof(int), (void*)&K);\n        err = clSetKernelArg(kernel3a, 2, sizeof(cl_mem), (void*)&bufA);\n        err = clSetKernelArg(kernel3a, 3, sizeof(int), (void*)&M_XL);\n        err = clSetKernelArg(kernel3a, 4, sizeof(int), (void*)&K_XL);\n        err = clSetKernelArg(kernel3a, 5, sizeof(cl_mem), (void*)&bufA_XL);\n        checkError(err,__LINE__);\n        cl_kernel kernel3b = clCreateKernel(program, \"paddingAddZeroes\", &err);\n        checkError(err,__LINE__);\n        err = clSetKernelArg(kernel3b, 0, sizeof(int), (void*)&N);\n        err = clSetKernelArg(kernel3b, 1, sizeof(int), (void*)&K);\n        err = clSetKernelArg(kernel3b, 2, sizeof(cl_mem), (void*)&bufB_TR);\n        err = clSetKernelArg(kernel3b, 3, sizeof(int), (void*)&N_XL);\n        err = clSetKernelArg(kernel3b, 4, sizeof(int), (void*)&K_XL);\n        err = clSetKernelArg(kernel3b, 5, sizeof(cl_mem), (void*)&bufB_TR_XL);\n        checkError(err,__LINE__);\n        cl_kernel kernel3c = clCreateKernel(program, \"paddingRemoveZeroes\", &err);\n        checkError(err,__LINE__);\n        err = clSetKernelArg(kernel3c, 0, sizeof(int), (void*)&M_XL);\n        err = clSetKernelArg(kernel3c, 1, sizeof(int), (void*)&N_XL);\n        err = clSetKernelArg(kernel3c, 2, sizeof(cl_mem), (void*)&bufC_XL);\n        err = clSetKernelArg(kernel3c, 3, sizeof(int), (void*)&M);\n        err = clSetKernelArg(kernel3c, 4, sizeof(int), (void*)&N);\n        err = clSetKernelArg(kernel3c, 5, sizeof(cl_mem), (void*)&bufC);\n        checkError(err,__LINE__);\n        const size_t pLocal[2] = { PADDINGX, PADDINGY };\n        const size_t pAGlobal[2] = { (size_t)M_XL, (size_t)K_XL };\n        const size_t pBGlobal[2] = { (size_t)N_XL, (size_t)K_XL };\n        const size_t pCGlobal[2] = { (size_t)M, (size_t)N };\n    #endif\n\n    // Configure the thread/work-group dimensions of the myGEMM kernel\n    #if KERNEL == 1 || KERNEL == 2\n        const size_t local[2] = { TS, TS };\n        const size_t global[2] = { (size_t)M, (size_t)N };\n    #elif KERNEL == 3 || KERNEL == 5\n        const size_t local[2] = { TS, TS/WPT };\n        const size_t global[2] = { (size_t)M, (size_t)(N/WPT) };\n    #elif KERNEL == 4\n        const size_t local[2] = { TS/WIDTH, TS };\n        const size_t global[2] = { (size_t)(M/WIDTH), (size_t)N };\n    #elif KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9\n        const size_t local[2] = { TSM/WPTM, TSN/WPTN };\n        const size_t global[2] = { (size_t)(M/WPTM), (size_t)(N/WPTN) };\n    #elif KERNEL == 10\n        const size_t local[2] = { TSM/WPTM, TSN/WPTN };\n        const size_t global[2] = { (size_t)(M_XL/WPTM), (size_t)(N_XL/WPTN) };\n    #elif KERNEL == 11\n        const size_t local[2] = { THREADSX, THREADSY };\n        const size_t global[2] = { (size_t)(M/RX), (size_t)(N/RY) };\n    #endif\n\n    // Start the timed loop\n    double startTime = timer();\n    for (int r=0; r<NUM_RUNS; r++) {\n\n        // Run the transpose kernel first\n        #if KERNEL == 5 || KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9 || KERNEL == 10\n            err = clEnqueueNDRangeKernel(queue, kernel2, 2, NULL, tGlobal, tLocal, 0, NULL, &event);\n        #endif\n\n        // Make the inputs extra large with padded zeros\n        #if KERNEL == 10\n            err = clEnqueueNDRangeKernel(queue, kernel3a, 2, NULL, pAGlobal, pLocal, 0, NULL, &event);\n            err = clEnqueueNDRangeKernel(queue, kernel3b, 2, NULL, pBGlobal, pLocal, 0, NULL, &event);\n        #endif\n\n        // Run the myGEMM kernel\n        err = clEnqueueNDRangeKernel(queue, kernel1, 2, NULL, global, local, 0, NULL, &event);\n\n        // Remove padded zeroes from the larger output\n        #if KERNEL == 10\n            err = clEnqueueNDRangeKernel(queue, kernel3c, 2, NULL, pCGlobal, pLocal, 0, NULL, &event);\n        #endif\n\n        // Wait for calculations to be finished\n        checkError(err,__LINE__);\n        err = clWaitForEvents(1, &event);\n    }\n\n    // End the timed loop\n    timers[timerID].t += (timer() - startTime) / (double)NUM_RUNS;\n    timers[timerID].kf += ((long)K * (long)M * (long)N * 2) / 1000;\n\n    // Copy the output matrix C back to the CPU memory\n    err = clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0, M*N*sizeof(*C), C, 0, NULL, NULL);\n    checkError(err,__LINE__);\n\n    // Free the memory objects\n    free(code);\n    clReleaseMemObject(bufA);\n    clReleaseMemObject(bufB);\n    clReleaseMemObject(bufB_TR);\n    clReleaseMemObject(bufC);\n    clReleaseMemObject(bufA_XL);\n    clReleaseMemObject(bufB_TR_XL);\n    clReleaseMemObject(bufC_XL);\n\n    // Clean-up OpenCL \n    clReleaseCommandQueue(queue);\n    clReleaseContext(context);\n    clReleaseProgram(program);\n    clReleaseKernel(kernel1);\n    #if KERNEL == 5 || KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9 || KERNEL == 10\n        clReleaseKernel(kernel2);\n    #endif\n    #if KERNEL == 10\n        clReleaseKernel(kernel3a);\n        clReleaseKernel(kernel3b);\n        clReleaseKernel(kernel3c);\n    #endif\n}\n\n// =================================================================================================\n\n// Print an error message to screen (only if it occurs)\nvoid checkError(cl_int error, int line) {\n    if (error != CL_SUCCESS) {\n        switch (error) {\n            case CL_DEVICE_NOT_FOUND:                 printf(\"-- Error at %d:  Device not found.\\n\", line); break;\n            case CL_DEVICE_NOT_AVAILABLE:             printf(\"-- Error at %d:  Device not available\\n\", line); break;\n            case CL_COMPILER_NOT_AVAILABLE:           printf(\"-- Error at %d:  Compiler not available\\n\", line); break;\n            case CL_MEM_OBJECT_ALLOCATION_FAILURE:    printf(\"-- Error at %d:  Memory object allocation failure\\n\", line); break;\n            case CL_OUT_OF_RESOURCES:                 printf(\"-- Error at %d:  Out of resources\\n\", line); break;\n            case CL_OUT_OF_HOST_MEMORY:               printf(\"-- Error at %d:  Out of host memory\\n\", line); break;\n            case CL_PROFILING_INFO_NOT_AVAILABLE:     printf(\"-- Error at %d:  Profiling information not available\\n\", line); break;\n            case CL_MEM_COPY_OVERLAP:                 printf(\"-- Error at %d:  Memory copy overlap\\n\", line); break;\n            case CL_IMAGE_FORMAT_MISMATCH:            printf(\"-- Error at %d:  Image format mismatch\\n\", line); break;\n            case CL_IMAGE_FORMAT_NOT_SUPPORTED:       printf(\"-- Error at %d:  Image format not supported\\n\", line); break;\n            case CL_BUILD_PROGRAM_FAILURE:            printf(\"-- Error at %d:  Program build failure\\n\", line); break;\n            case CL_MAP_FAILURE:                      printf(\"-- Error at %d:  Map failure\\n\", line); break;\n            case CL_INVALID_VALUE:                    printf(\"-- Error at %d:  Invalid value\\n\", line); break;\n            case CL_INVALID_DEVICE_TYPE:              printf(\"-- Error at %d:  Invalid device type\\n\", line); break;\n            case CL_INVALID_PLATFORM:                 printf(\"-- Error at %d:  Invalid platform\\n\", line); break;\n            case CL_INVALID_DEVICE:                   printf(\"-- Error at %d:  Invalid device\\n\", line); break;\n            case CL_INVALID_CONTEXT:                  printf(\"-- Error at %d:  Invalid context\\n\", line); break;\n            case CL_INVALID_QUEUE_PROPERTIES:         printf(\"-- Error at %d:  Invalid queue properties\\n\", line); break;\n            case CL_INVALID_COMMAND_QUEUE:            printf(\"-- Error at %d:  Invalid command queue\\n\", line); break;\n            case CL_INVALID_HOST_PTR:                 printf(\"-- Error at %d:  Invalid host pointer\\n\", line); break;\n            case CL_INVALID_MEM_OBJECT:               printf(\"-- Error at %d:  Invalid memory object\\n\", line); break;\n            case CL_INVALID_IMAGE_FORMAT_DESCRIPTOR:  printf(\"-- Error at %d:  Invalid image format descriptor\\n\", line); break;\n            case CL_INVALID_IMAGE_SIZE:               printf(\"-- Error at %d:  Invalid image size\\n\", line); break;\n            case CL_INVALID_SAMPLER:                  printf(\"-- Error at %d:  Invalid sampler\\n\", line); break;\n            case CL_INVALID_BINARY:                   printf(\"-- Error at %d:  Invalid binary\\n\", line); break;\n            case CL_INVALID_BUILD_OPTIONS:            printf(\"-- Error at %d:  Invalid build options\\n\", line); break;\n            case CL_INVALID_PROGRAM:                  printf(\"-- Error at %d:  Invalid program\\n\", line); break;\n            case CL_INVALID_PROGRAM_EXECUTABLE:       printf(\"-- Error at %d:  Invalid program executable\\n\", line); break;\n            case CL_INVALID_KERNEL_NAME:              printf(\"-- Error at %d:  Invalid kernel name\\n\", line); break;\n            case CL_INVALID_KERNEL_DEFINITION:        printf(\"-- Error at %d:  Invalid kernel definition\\n\", line); break;\n            case CL_INVALID_KERNEL:                   printf(\"-- Error at %d:  Invalid kernel\\n\", line); break;\n            case CL_INVALID_ARG_INDEX:                printf(\"-- Error at %d:  Invalid argument index\\n\", line); break;\n            case CL_INVALID_ARG_VALUE:                printf(\"-- Error at %d:  Invalid argument value\\n\", line); break;\n            case CL_INVALID_ARG_SIZE:                 printf(\"-- Error at %d:  Invalid argument size\\n\", line); break;\n            case CL_INVALID_KERNEL_ARGS:              printf(\"-- Error at %d:  Invalid kernel arguments\\n\", line); break;\n            case CL_INVALID_WORK_DIMENSION:           printf(\"-- Error at %d:  Invalid work dimensionsension\\n\", line); break;\n            case CL_INVALID_WORK_GROUP_SIZE:          printf(\"-- Error at %d:  Invalid work group size\\n\", line); break;\n            case CL_INVALID_WORK_ITEM_SIZE:           printf(\"-- Error at %d:  Invalid work item size\\n\", line); break;\n            case CL_INVALID_GLOBAL_OFFSET:            printf(\"-- Error at %d:  Invalid global offset\\n\", line); break;\n            case CL_INVALID_EVENT_WAIT_LIST:          printf(\"-- Error at %d:  Invalid event wait list\\n\", line); break;\n            case CL_INVALID_EVENT:                    printf(\"-- Error at %d:  Invalid event\\n\", line); break;\n            case CL_INVALID_OPERATION:                printf(\"-- Error at %d:  Invalid operation\\n\", line); break;\n            case CL_INVALID_GL_OBJECT:                printf(\"-- Error at %d:  Invalid OpenGL object\\n\", line); break;\n            case CL_INVALID_BUFFER_SIZE:              printf(\"-- Error at %d:  Invalid buffer size\\n\", line); break;\n            case CL_INVALID_MIP_LEVEL:                printf(\"-- Error at %d:  Invalid mip-map level\\n\", line); break;\n            case -1024:                               printf(\"-- Error at %d:  *clBLAS* Functionality is not implemented\\n\", line); break;\n            case -1023:                               printf(\"-- Error at %d:  *clBLAS* Library is not initialized yet\\n\", line); break;\n            case -1022:                               printf(\"-- Error at %d:  *clBLAS* Matrix A is not a valid memory object\\n\", line); break;\n            case -1021:                               printf(\"-- Error at %d:  *clBLAS* Matrix B is not a valid memory object\\n\", line); break;\n            case -1020:                               printf(\"-- Error at %d:  *clBLAS* Matrix C is not a valid memory object\\n\", line); break;\n            case -1019:                               printf(\"-- Error at %d:  *clBLAS* Vector X is not a valid memory object\\n\", line); break;\n            case -1018:                               printf(\"-- Error at %d:  *clBLAS* Vector Y is not a valid memory object\\n\", line); break;\n            case -1017:                               printf(\"-- Error at %d:  *clBLAS* An input dimension (M,N,K) is invalid\\n\", line); break;\n            case -1016:                               printf(\"-- Error at %d:  *clBLAS* Leading dimension A must not be less than the size of the first dimension\\n\", line); break;\n            case -1015:                               printf(\"-- Error at %d:  *clBLAS* Leading dimension B must not be less than the size of the second dimension\\n\", line); break;\n            case -1014:                               printf(\"-- Error at %d:  *clBLAS* Leading dimension C must not be less than the size of the third dimension\\n\", line); break;\n            case -1013:                               printf(\"-- Error at %d:  *clBLAS* The increment for a vector X must not be 0\\n\", line); break;\n            case -1012:                               printf(\"-- Error at %d:  *clBLAS* The increment for a vector Y must not be 0\\n\", line); break;\n            case -1011:                               printf(\"-- Error at %d:  *clBLAS* The memory object for Matrix A is too small\\n\", line); break;\n            case -1010:                               printf(\"-- Error at %d:  *clBLAS* The memory object for Matrix B is too small\\n\", line); break;\n            case -1009:                               printf(\"-- Error at %d:  *clBLAS* The memory object for Matrix C is too small\\n\", line); break;\n            case -1008:                               printf(\"-- Error at %d:  *clBLAS* The memory object for Vector X is too small\\n\", line); break;\n            case -1007:                               printf(\"-- Error at %d:  *clBLAS* The memory object for Vector Y is too small\\n\", line); break;\n            case -1001:                               printf(\"-- Error at %d:  Code -1001: no GPU available?\\n\", line); break;\n            default:                                  printf(\"-- Error at %d:  Unknown with code %d\\n\", line, error);\n        }\n        exit(1);\n    }\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "src/cl_to_cuda.h",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-06\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Replace the OpenCL keywords with CUDA equivalent\n#define __kernel __placeholder__\n#define __global \n#define __placeholder__ __global__\n#define __local __shared__\n#define restrict __restrict__\n\n// Replace OpenCL synchronisation with CUDA synchronisation\n#define barrier(x) __syncthreads()\n\n// Replace the OpenCL get_xxx_ID with CUDA equivalents\n__device__ int get_local_id(int x) {\n    return (x == 0) ? threadIdx.x : threadIdx.y;\n}\n__device__ int get_group_id(int x) {\n    return (x == 0) ? blockIdx.x : blockIdx.y;\n}\n__device__ int get_global_id(int x) {\n    return (x == 0) ? blockIdx.x*blockDim.x + threadIdx.x : blockIdx.y*blockDim.y + threadIdx.y;\n}\n\n// Add the float8 data-type which is not available natively under CUDA\ntypedef struct { float s0; float s1; float s2; float s3;\n                 float s4; float s5; float s6; float s7; } float8;\n\n// =================================================================================================\n"
  },
  {
    "path": "src/common.h",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-17\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Common C includes\n#include <stdlib.h>\n#include <stdio.h>\n#include <string.h>\n#include <math.h>\n#include <time.h>\n#include <sys/time.h>\n\n// =================================================================================================\n\n// Repeat all kernels multiple times to get an average timing result\n#define NUM_RUNS 4\n\n// Squared matrices are tested within a certain range (e.g. 1024x1024, 2048x2048, 4096x4096)\n#define MINSIZE (1024)\n#define MAXSIZE (4*1024)\n\n// Set the alpha and beta values for the cuBLAS and clBlas libraries. Note that the myGEMM kernels\n// for simplicity only support alpha values of 1 and beta values of 0.\n#define ALPHA 1.0f\n#define BETA 0.0f\n\n// Define the current GPU's parameters\n#define GPU_NAME \"Tesla K40m\"\n#define GPU_CLOCK 0.745 // Core clock in GHz\n#define GPU_CORES 2880 // Total number of CUDA cores\n#define GPU_MOD 2 // Fused multiply-add\n\n// OpenCL settings\n#define MAX_NUM_DEVICES 16\n#define MAX_DEVICE_NAME 1024\n#define CURRENT_DEVICE 0\n\n// =================================================================================================\n\n// Timer structure\ntypedef struct {\n    double t; // Time\n    int long long kf; // KFlops\n} profile_t;\n\n// Number of timers\n#define NUM_TIMERS 10\n\n// Global variable holding the timing results\nextern profile_t timers[NUM_TIMERS];\n\n// =================================================================================================\n\n// Forward declarations of BLAS functions\nvoid libcublas(float* A, float* B, float* C,\n               int K, int M, int N,\n               int timerID);\nvoid libclblas(float* A, float* B, float* C,\n               int K, int M, int N,\n               int timerID);\nvoid mycublas(float* A, float* B, float* C,\n              int K, int M, int N,\n              int timerID);\nvoid myclblas(float* A, float* B, float* C,\n              int K, int M, int N,\n              int timerID);\n\n// Forward declarations of the timer functions\ndouble timer(void);\ndouble wtime(profile_t timer);\ndouble gflops(profile_t timer);\n\n// Other forward declarations\nchar* readKernelFile(const char* filename, long* _size);\n\n// =================================================================================================\n"
  },
  {
    "path": "src/cuGEMM.cu",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-06\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Common include\n#include \"common.h\"\n\n// Include kernel constants\n#include \"settings.h\"\n\n// =================================================================================================\n\n// Configuration settings for the CUDA version (comment out if not desired)\n#define USE_LDG         // Whether to use the __ldg() intrinsic\n//#define USE_SHUFFLE   // Whether to use warp-shuffle instructions\n\n// Include the OpenCL-to-CUDA header and the OpenCL kernel-code\n#include \"cl_to_cuda.h\"\n#include \"kernels.cl\"\n\n// =================================================================================================\n\n// Matrix-multiplication using a custom CUDA SGEMM kernel. This function also copies the input\n// matrices to the GPU, runs SGEMM, and copies the output matrix back to the CPU.\nvoid mycublas(float* A, float* B, float* C,\n              int K, int M, int N,\n              int timerID) {\n\n    // In case of myGEMM10, compute matrix sizes K, M, N as rounded-up to form complete tiles\n    #if KERNEL == 10\n        int K_XL = CEIL_DIV(K, TSK) * TSK;\n        int M_XL = CEIL_DIV(M, TSM) * TSM;\n        int N_XL = CEIL_DIV(N, TSN) * TSN;\n    #else\n        int K_XL = K;\n        int M_XL = M;\n        int N_XL = N;\n    #endif\n\n    // Prepare CUDA memory objects\n    float* bufA = 0;\n    float* bufB = 0;\n    float* bufB_TR = 0; // This is the transposed version of B\n    float* bufC = 0;\n    cudaMalloc((void**)&bufA,    M*K*sizeof(*A));\n    cudaMalloc((void**)&bufB,    K*N*sizeof(*B));\n    cudaMalloc((void**)&bufB_TR, N*K*sizeof(*B));\n    cudaMalloc((void**)&bufC,    M*N*sizeof(*C));\n\n    // Copy matrices to the GPU (memset C to erase the results of the previous run)\n    cudaMemcpy((void*)bufA, (void*)A, M*K*sizeof(*A), cudaMemcpyHostToDevice);\n    cudaMemcpy((void*)bufB, (void*)B, K*N*sizeof(*B), cudaMemcpyHostToDevice);\n    cudaMemset((void*)bufC, 0.0, M*N*sizeof(*C));\n\n    // Create extra objects for rounded-up sizes (only needed in case of myGEMM10)\n    float* bufA_XL = 0;\n    float* bufB_TR_XL = 0;\n    float* bufC_XL = 0;\n    cudaMalloc((void**)&bufA_XL,    M_XL*K_XL*sizeof(*A));\n    cudaMalloc((void**)&bufB_TR_XL, K_XL*N_XL*sizeof(*B));\n    cudaMalloc((void**)&bufC_XL,    M_XL*N_XL*sizeof(*C));\n\n    // Configure the local memory (banks of 8 bytes, 48KB local memory)\n    cudaDeviceSetSharedMemConfig(cudaSharedMemBankSizeEightByte);\n    cudaDeviceSetCacheConfig(cudaFuncCachePreferShared);\n\n    // Configure the thread/threadblock dimensions of the transpose kernel (only for certain myGEMMs)\n    #if KERNEL == 5 || KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9 || KERNEL == 10\n        dim3 blocksTRP(CEIL_DIV(K,TRANSPOSEX), CEIL_DIV(N,TRANSPOSEY));\n        dim3 threadsTRP(TRANSPOSEX, TRANSPOSEY);\n    #endif\n\n    // Configure the thread/threadblock dimensions of the padding kernels (only for myGEMM10)\n    #if KERNEL == 10\n        dim3 blocksA(CEIL_DIV(M_XL,PADDINGX), CEIL_DIV(K_XL,PADDINGY));\n        dim3 threadsA(PADDINGX, PADDINGY);\n        dim3 blocksB(CEIL_DIV(N_XL,PADDINGX), CEIL_DIV(K_XL,PADDINGY));\n        dim3 threadsB(PADDINGX, PADDINGY);\n        dim3 blocksC(CEIL_DIV(M,PADDINGX), CEIL_DIV(N,PADDINGY));\n        dim3 threadsC(PADDINGX, PADDINGY);\n    #endif\n\n    // Configure the thread/threadblock dimensions of the myGEMM kernel\n    #if KERNEL == 1 || KERNEL == 2\n        dim3 blocks(M/TS, N/TS);\n        dim3 threads(TS, TS);\n    #elif KERNEL == 3 || KERNEL == 5\n        dim3 blocks(M/TS, N/TS);\n        dim3 threads(TS, TS/WPT);\n    #elif KERNEL == 4\n        dim3 blocks(M/TS, N/TS);\n        dim3 threads(TS/WIDTH, TS);\n    #elif KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9\n        dim3 blocks(M/TSM, N/TSN);\n        dim3 threads(TSM/WPTM, TSN/WPTN);\n    #elif KERNEL == 10\n        dim3 blocks(M_XL/TSM, N_XL/TSN);\n        dim3 threads(TSM/WPTM, TSN/WPTN);\n    #elif KERNEL == 11\n        dim3 blocks(M/(THREADSX*RX), N/(THREADSY*RY));\n        dim3 threads(THREADSX, THREADSY);\n    #endif\n\n    // Start the timed loop\n    double startTime = timer();\n    for (int r=0; r<NUM_RUNS; r++) {\n\n        // Run the transpose kernel first\n        #if KERNEL == 5 || KERNEL == 6 || KERNEL == 7 || KERNEL == 8 || KERNEL == 9 || KERNEL == 10\n            transpose<<<blocksTRP, threadsTRP>>>(K, N, bufB, bufB_TR);\n        #endif\n\n        // Make the inputs extra large with padded zeros\n        #if KERNEL == 10\n            paddingAddZeroes<<<blocksA, threadsA>>>(M, K, bufA, M_XL, K_XL, bufA_XL);\n            paddingAddZeroes<<<blocksB, threadsB>>>(N, K, bufB_TR, N_XL, K_XL, bufB_TR_XL);\n        #endif\n\n        // Run the myGEMM kernel\n        #if KERNEL == 1\n            myGEMM1<<<blocks, threads>>>(M, N, K, bufA, bufB, bufC);\n        #elif KERNEL == 2\n            myGEMM2<<<blocks, threads>>>(M, N, K, bufA, bufB, bufC);\n        #elif KERNEL == 3\n            myGEMM3<<<blocks, threads>>>(M, N, K, bufA, bufB, bufC);\n        #elif KERNEL == 4\n            myGEMM4<<<blocks, threads>>>(M, N, K, (floatX*)bufA, (floatX*)bufB, (floatX*)bufC);\n        #elif KERNEL == 5\n            myGEMM5<<<blocks, threads>>>(M, N, K, bufA, bufB_TR, bufC);\n        #elif KERNEL == 6\n            myGEMM6<<<blocks, threads>>>(M, N, K, bufA, bufB_TR, bufC);\n        #elif KERNEL == 7\n            myGEMM7<<<blocks, threads>>>(M, N, K, (floatX*)bufA, (floatX*)bufB_TR, bufC);\n        #elif KERNEL == 8\n            myGEMM8<<<blocks, threads>>>(M, N, K, (floatX*)bufA, (floatX*)bufB_TR, bufC);\n        #elif KERNEL == 9\n            myGEMM9<<<blocks, threads>>>(M, N, K, (floatX*)bufA, (floatX*)bufB_TR, bufC);\n        #elif KERNEL == 10\n            myGEMM10<<<blocks, threads>>>(M_XL, N_XL, K_XL, (floatX*)bufA_XL, (floatX*)bufB_TR_XL, bufC_XL);\n        #elif KERNEL == 11\n            myGEMM11<<<blocks, threads>>>(M, N, K, (floatA*)bufA, (floatB*)bufB, (floatC*)bufC);\n        #endif\n\n        // Remove padded zeroes from the larger output\n        #if KERNEL == 10\n            paddingRemoveZeroes<<<blocksC, threadsC>>>(M_XL, N_XL, bufC_XL, M, N, bufC);\n        #endif\n\n        // Wait for calculations to be finished\n        cudaDeviceSynchronize();\n    }\n\n    // End the timed loop\n    timers[timerID].t += (timer() - startTime) / (double)NUM_RUNS;\n    timers[timerID].kf += ((long)K * (long)M * (long)N * 2) / 1000;\n\n    // Copy the output matrix C back to the CPU memory\n    cudaMemcpy((void*)C, (void*)bufC, M*N*sizeof(*C), cudaMemcpyDeviceToHost);\n\n    // Free the GPU memory objects\n    cudaFree(bufA);\n    cudaFree(bufB);\n    cudaFree(bufB_TR);\n    cudaFree(bufC);\n    cudaFree(bufA_XL);\n    cudaFree(bufB_TR_XL);\n    cudaFree(bufC_XL);\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "src/kernels.cl",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-06\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n//\n// Matrices in column-major format\n// A: K columns, M rows\n// B: N columns, K rows\n// C: N columns, M rows\n//                         \n//                   N     \n//                o-----o  \n//                |     |  \n//              K | [B] |  \n//                |     |  \n//                o-----o  \n//        K          N     \n//    o-------o   o-----o  \n//  M |  [A]  | M | [C] |  \n//    |       |   |     |  \n//    o-------o   o-----o  \n//                         \n//\n// C-code for column-major matrix multiplication with alpha=1 and beta=0:\n//\n// for (int m=0; m<M; m++) {\n//     for (int n=0; n<N; n++) {\n//         float acc = 0.0f;\n//         for (int k=0; k<K; k++) {\n//             acc += A[k*M + m] * B[n*K + k];\n//         }\n//         C[n*M + m] = acc;\n//     }\n// }\n//\n// =================================================================================================\n\n// Data-widths\n#if WIDTH == 1\n    typedef float floatX;\n#elif WIDTH == 2\n    typedef float2 floatX;\n#elif WIDTH == 4\n    typedef float4 floatX;\n#elif WIDTH == 8\n    typedef float8 floatX;\n#endif\n\n// =================================================================================================\n#if KERNEL == 1\n\n// First naive implementation\n__kernel void myGEMM1(const int M, const int N, const int K,\n                      const __global float* A,\n                      const __global float* B,\n                      __global float* C) {\n    \n    // Thread identifiers\n    const int globalRow = get_global_id(0); // Row ID of C (0..M)\n    const int globalCol = get_global_id(1); // Col ID of C (0..N)\n\n    // Compute a single element (loop over K)\n    float acc = 0.0f;\n    for (int k=0; k<K; k++) {\n        acc += A[k*M + globalRow] * B[globalCol*K + k];\n    }\n\n    // Store the result\n    C[globalCol*M + globalRow] = acc;\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 2\n\n// Tiled and coalesced version\n__kernel void myGEMM2(const int M, const int N, const int K,\n                      const __global float* A,\n                      const __global float* B,\n                      __global float* C) {\n    \n    // Thread identifiers\n    const int row = get_local_id(0); // Local row ID (max: TS)\n    const int col = get_local_id(1); // Local col ID (max: TS)\n    const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M)\n    const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)\n\n    // Local memory to fit a tile of TS*TS elements of A and B\n    __local float Asub[TS][TS];\n    __local float Bsub[TS][TS];\n\n    // Initialise the accumulation register\n    float acc = 0.0f;\n    \n    // Loop over all tiles\n    const int numTiles = K/TS;\n    for (int t=0; t<numTiles; t++) {\n\n        // Load one tile of A and B into local memory\n        const int tiledRow = TS*t + row;\n        const int tiledCol = TS*t + col;\n        Asub[col][row] = A[tiledCol*M + globalRow];\n        Bsub[col][row] = B[globalCol*K + tiledRow];\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Perform the computation for a single tile\n        for (int k=0; k<TS; k++) {\n            acc += Asub[k][row] * Bsub[col][k];\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n    }\n\n    // Store the final result in C\n    C[globalCol*M + globalRow] = acc;\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 3\n\n// Increased the amount of work-per-thread by a factor WPT\n__kernel void myGEMM3(const int M, const int N, const int K,\n                      const __global float* A,\n                      const __global float* B,\n                      __global float* C) {\n    \n    // Thread identifiers\n    const int row = get_local_id(0); // Local row ID (max: TS)\n    const int col = get_local_id(1); // Local col ID (max: TS/WPT == RTS)\n    const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M)\n    const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)\n\n    // Local memory to fit a tile of TS*TS elements of A and B\n    __local float Asub[TS][TS];\n    __local float Bsub[TS][TS];\n\n    // Initialise the accumulation registers\n    float acc[WPT];\n    for (int w=0; w<WPT; w++) {\n        acc[w] = 0.0f;\n    }\n    \n    // Loop over all tiles\n    const int numTiles = K/TS;\n    for (int t=0; t<numTiles; t++) {\n\n        // Load one tile of A and B into local memory\n        for (int w=0; w<WPT; w++) {\n            const int tiledRow = TS*t + row;\n            const int tiledCol = TS*t + col;\n            Asub[col + w*RTS][row] = A[(tiledCol + w*RTS)*M + globalRow];\n            Bsub[col + w*RTS][row] = B[(globalCol + w*RTS)*K + tiledRow];\n        }\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Perform the computation for a single tile\n        for (int k=0; k<TS; k++) {\n            for (int w=0; w<WPT; w++) {\n                acc[w] += Asub[k][row] * Bsub[col + w*RTS][k];\n            }\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n    }\n\n    // Store the final results in C\n    for (int w=0; w<WPT; w++) {\n        C[(globalCol + w*RTS)*M + globalRow] = acc[w];\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 4\n\n// Use wider data types\n__kernel void myGEMM4(const int M, const int N, const int K,\n                      const __global floatX* A,\n                      const __global floatX* B,\n                      __global floatX* C) {\n\n    // Thread identifiers\n    const int row = get_local_id(0); // Local row ID (max: TS/WIDTH)\n    const int col = get_local_id(1); // Local col ID (max: TS)\n    const int globalRow = (TS/WIDTH)*get_group_id(0) + row; // Row ID of C (0..M/WIDTH)\n    const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)\n\n    // Local memory to fit a tile of TS*TS elements of A and B\n    __local floatX Asub[TS][TS/WIDTH];\n    __local floatX Bsub[TS][TS/WIDTH];\n\n    // Initialise the accumulation registers\n    #if WIDTH == 1\n        floatX acc = 0.0f;\n    #elif WIDTH == 2\n        floatX acc = { 0.0f, 0.0f };\n    #elif WIDTH == 4\n        floatX acc = { 0.0f, 0.0f, 0.0f, 0.0f };\n    #elif WIDTH == 8\n        floatX acc = { 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f, 0.0f };\n    #endif\n    \n    // Loop over all tiles\n    const int numTiles = K/TS;\n    for (int tile=0; tile<numTiles; tile++) {\n\n        // Load one tile of A and B into local memory\n        const int tiledRow = (TS/WIDTH)*tile + row;\n        const int tiledCol = TS*tile + col;\n        Asub[col][row] = A[tiledCol*(M/WIDTH) + globalRow];\n        Bsub[col][row] = B[globalCol*(K/WIDTH) + tiledRow];\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Perform the computation for a single tile\n        floatX vecA, vecB;\n        float valB;\n        for (int k=0; k<TS/WIDTH; k++) {\n            vecB = Bsub[col][k];\n            for (int w=0; w<WIDTH; w++) {\n                vecA = Asub[WIDTH*k + w][row];\n                #if WIDTH == 1\n                    valB = vecB;\n                    acc += vecA * valB;\n                #elif WIDTH == 2\n                    switch (w) {\n                        case 0: valB = vecB.x; break;\n                        case 1: valB = vecB.y; break;\n                    }\n                    acc.x += vecA.x * valB;\n                    acc.y += vecA.y * valB;\n                #elif WIDTH == 4\n                    switch (w) {\n                        case 0: valB = vecB.x; break;\n                        case 1: valB = vecB.y; break;\n                        case 2: valB = vecB.z; break;\n                        case 3: valB = vecB.w; break;\n                    }\n                    acc.x += vecA.x * valB;\n                    acc.y += vecA.y * valB;\n                    acc.z += vecA.z * valB;\n                    acc.w += vecA.w * valB;\n                #elif WIDTH == 8\n                    switch (w) {\n                        case 0: valB = vecB.s0; break;\n                        case 1: valB = vecB.s1; break;\n                        case 2: valB = vecB.s2; break;\n                        case 3: valB = vecB.s3; break;\n                        case 4: valB = vecB.s4; break;\n                        case 5: valB = vecB.s5; break;\n                        case 6: valB = vecB.s6; break;\n                        case 7: valB = vecB.s7; break;\n                    }\n                    acc.s0 += vecA.s0 * valB;\n                    acc.s1 += vecA.s1 * valB;\n                    acc.s2 += vecA.s2 * valB;\n                    acc.s3 += vecA.s3 * valB;\n                    acc.s4 += vecA.s4 * valB;\n                    acc.s5 += vecA.s5 * valB;\n                    acc.s6 += vecA.s6 * valB;\n                    acc.s7 += vecA.s7 * valB;\n                #endif\n            }\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n    }\n\n    // Store the final results in C\n    C[globalCol*(M/WIDTH) + globalRow] = acc;\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 5\n\n// Pre-transpose the input matrix B and use rectangular tiles\n__kernel void myGEMM5(const int M, const int N, const int K,\n                      const __global float* A,\n                      const __global float* B,\n                      __global float* C) {\n\n    // Thread identifiers\n    const int row = get_local_id(0); // Local row ID (max: TS)\n    const int col = get_local_id(1); // Local col ID (max: TS/WPT == RTS)\n    const int globalRow = TS*get_group_id(0) + row; // Row ID of C (0..M)\n    const int globalCol = TS*get_group_id(1) + col; // Col ID of C (0..N)\n\n    // Local memory to fit a tile of A and B\n    __local float Asub[TSDK][TS];\n    __local float Bsub[TS][TSDK+2];\n\n    // Initialise the accumulation registers\n    float acc[WPT];\n    for (int w=0; w<WPT; w++) {\n        acc[w] = 0.0f;\n    }\n    \n    // Loop over all tiles\n    const int numTiles = K/TSDK;\n    for (int t=0; t<numTiles; t++) {\n\n        // Load one tile of A and B into local memory\n        for (int l=0; l<LPT; l++) {\n            const int tiledIndex = TSDK*t + col + l*RTS;\n            int indexA = (tiledIndex)*M + TS*get_group_id(0) + row;\n            int indexB = (tiledIndex)*N + TS*get_group_id(1) + row;\n            Asub[col + l*RTS][row] = A[indexA];\n            Bsub[row][col + l*RTS] = B[indexB];\n        }\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Perform the computation for a single tile\n        for (int k=0; k<TSDK; k++) {\n            for (int w=0; w<WPT; w++) {\n                acc[w] += Asub[k][row] * Bsub[col + w*RTS][k];\n            }\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n    }\n\n    // Store the final results in C\n    for (int w=0; w<WPT; w++) {\n        C[(globalCol + w*RTS)*M + globalRow] = acc[w];\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 6\n\n// Use 2D register blocking (further increase in work per thread)\n__kernel void myGEMM6(const int M, const int N, const int K,\n                      const __global float* A,\n                      const __global float* B,\n                      __global float* C) {\n\n    // Thread identifiers\n    const int tidm = get_local_id(0); // Local row ID (max: TSM/WPTM == RTSM)\n    const int tidn = get_local_id(1); // Local col ID (max: TSN/WPTN == RTSN)\n    const int offsetM = TSM*get_group_id(0); // Work-group offset\n    const int offsetN = TSN*get_group_id(1); // Work-group offset\n\n    // Local memory to fit a tile of A and B\n    __local float Asub[TSK][TSM];\n    __local float Bsub[TSN][TSK+2];\n\n    // Allocate register space\n    float Areg;\n    float Breg[WPTN];\n    float acc[WPTM][WPTN];\n\n    // Initialise the accumulation registers\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            acc[wm][wn] = 0.0f;\n        }\n    }\n    \n    // Loop over all tiles\n    const int numTiles = K/TSK;\n    int t=0;\n    do {\n\n        // Load one tile of A and B into local memory\n        #pragma unroll\n        for (int la=0; la<LPTA; la++) {\n            int tid = tidn*RTSM + tidm;\n            int id = la*RTSN*RTSM + tid;\n            int row = MOD2(id,TSM);\n            int col = DIV2(id,TSM);\n            int tiledIndex = TSK*t + col;\n            Asub[col][row] = A[tiledIndex*M + offsetM + row];\n            Bsub[row][col] = B[tiledIndex*N + offsetN + row];\n        }\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Loop over the values of a single tile\n        for (int k=0; k<TSK; k++) {\n\n            // Cache the values of Bsub in registers\n            #pragma unroll\n            for (int wn=0; wn<WPTN; wn++) {\n                int col = tidn + wn*RTSN;\n                Breg[wn] = Bsub[col][k];\n            }\n\n            // Perform the computation\n            #pragma unroll\n            for (int wm=0; wm<WPTM; wm++) {\n                int row = tidm + wm*RTSM;\n                Areg = Asub[k][row];\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    acc[wm][wn] += Areg * Breg[wn];\n                }\n            }\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Next tile\n        t++;\n    } while (t<numTiles);\n\n    // Store the final results in C\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        int globalRow = offsetM + tidm + wm*RTSM;\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            int globalCol = offsetN + tidn + wn*RTSN;\n            C[globalCol*M + globalRow] = acc[wm][wn];\n        }\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 7\n\n// Wider loads combined with 2D register blocking\n__kernel void myGEMM7(const int M, const int N, const int K,\n                      const __global floatX* A,\n                      const __global floatX* B,\n                      __global float* C) {\n\n    // Thread identifiers\n    const int tidm = get_local_id(0); // Local row ID (max: TSM/WPTM == RTSM)\n    const int tidn = get_local_id(1); // Local col ID (max: TSN/WPTN == RTSN)\n    const int offsetM = TSM*get_group_id(0); // Work-group offset\n    const int offsetN = TSN*get_group_id(1); // Work-group offset\n\n    // Local memory to fit a tile of A and B\n    __local float Asub[TSK][TSM];\n    __local float Bsub[TSK][TSN];\n\n    // Allocate register space\n    float Areg;\n    float Breg[WPTN];\n    float acc[WPTM][WPTN];\n\n    // Initialise the accumulation registers\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            acc[wm][wn] = 0.0f;\n        }\n    }\n    \n    // Loop over all tiles\n    const int numTiles = K/TSK;\n    int t=0;\n    do {\n\n        // Load one tile of A and B into local memory\n        #pragma unroll\n        for (int la=0; la<LPTA/WIDTH; la++) {\n            int tid = tidn*RTSM + tidm;\n            int id = la*RTSN*RTSM + tid;\n            int row = MOD2(id,TSM/WIDTH);\n            int col = DIV2(id,TSM/WIDTH);\n\n            // Load the values (wide vector load)\n            int tiledIndex = TSK*t + col;\n            floatX vecA = A[tiledIndex*(M/WIDTH) + offsetM/WIDTH + row];\n            floatX vecB = B[tiledIndex*(N/WIDTH) + offsetN/WIDTH + row];\n\n            // Store the loaded vectors into local memory\n            #if WIDTH == 1\n                Asub[col][row] = vecA;\n            #elif WIDTH == 2\n                Asub[col][WIDTH*row + 0] = vecA.x;\n                Asub[col][WIDTH*row + 1] = vecA.y;\n            #elif WIDTH == 4\n                Asub[col][WIDTH*row + 0] = vecA.x;\n                Asub[col][WIDTH*row + 1] = vecA.y;\n                Asub[col][WIDTH*row + 2] = vecA.z;\n                Asub[col][WIDTH*row + 3] = vecA.w;\n            #endif\n            #if WIDTH == 1\n                Bsub[col][row] = vecB;\n            #elif WIDTH == 2\n                Bsub[col][WIDTH*row + 0] = vecB.x;\n                Bsub[col][WIDTH*row + 1] = vecB.y;\n            #elif WIDTH == 4\n                Bsub[col][WIDTH*row + 0] = vecB.x;\n                Bsub[col][WIDTH*row + 1] = vecB.y;\n                Bsub[col][WIDTH*row + 2] = vecB.z;\n                Bsub[col][WIDTH*row + 3] = vecB.w;\n            #endif\n        }\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Loop over the values of a single tile\n        #pragma unroll\n        for (int k=0; k<TSK; k++) {\n\n            // Cache the values of Bsub in registers\n            #pragma unroll\n            for (int wn=0; wn<WPTN; wn++) {\n                int col = tidn + wn*RTSN;\n                Breg[wn] = Bsub[k][col];\n            }\n\n            // Perform the computation\n            #pragma unroll\n            for (int wm=0; wm<WPTM; wm++) {\n                int row = tidm + wm*RTSM;\n                Areg = Asub[k][row];\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    acc[wm][wn] += Areg * Breg[wn];\n                }\n            }\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Next tile\n        t++;\n    } while (t<numTiles);\n\n    // Store the final results in C\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        int globalRow = offsetM + tidm + wm*RTSM;\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            int globalCol = offsetN + tidn + wn*RTSN;\n            C[globalCol*M + globalRow] = acc[wm][wn];\n        }\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 8\n\n// CUDA and Kepler-specific optimisations (LDG and warp-shuffle)\n__kernel void myGEMM8(const int M, const int N, const int K,\n                      const __global floatX* A,\n                      const __global floatX* B,\n                      __global float* C) {\n\n    // Thread identifiers\n    const int tidm = get_local_id(0); // Local row ID (max: TSM/WPTM == RTSM)\n    const int tidn = get_local_id(1); // Local col ID (max: TSN/WPTN == RTSN)\n    const int offsetM = TSM*get_group_id(0); // Work-group offset\n    const int offsetN = TSN*get_group_id(1); // Work-group offset\n\n    // Local memory to fit a tile of A and B\n    __local float Asub[TSK][TSM];\n    __local float Bsub[TSK][TSN];\n\n    // Allocate register space\n    float Areg;\n    float Breg[WPTN];\n    float acc[WPTM][WPTN];\n\n    // Initialise the accumulation registers\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            acc[wm][wn] = 0.0f;\n        }\n    }\n    \n    // Loop over all tiles\n    const int numTiles = K/TSK;\n    int t=0;\n    do {\n\n        // Load one tile of A and B into local memory\n        #pragma unroll\n        for (int la=0; la<LPTA/WIDTH; la++) {\n            int tid = tidn*RTSM + tidm;\n            int id = la*RTSN*RTSM + tid;\n            int row = MOD2(id,TSM/WIDTH);\n            int col = DIV2(id,TSM/WIDTH);\n\n            // Load the values (wide vector load)\n            int tiledIndex = TSK*t + col;\n            int indexA = tiledIndex*(M/WIDTH) + offsetM/WIDTH + row;\n            int indexB = tiledIndex*(N/WIDTH) + offsetN/WIDTH + row;\n            #ifdef USE_LDG\n                floatX vecA = __ldg(&A[indexA]);\n                floatX vecB = __ldg(&B[indexB]);\n            #else\n                floatX vecA = A[indexA];\n                floatX vecB = B[indexB];\n            #endif\n\n            // Store the loaded vectors into local memory\n            #if WIDTH == 1\n                Asub[col][row] = vecA;\n            #elif WIDTH == 2\n                Asub[col][WIDTH*row + 0] = vecA.x;\n                Asub[col][WIDTH*row + 1] = vecA.y;\n            #elif WIDTH == 4\n                Asub[col][WIDTH*row + 0] = vecA.x;\n                Asub[col][WIDTH*row + 1] = vecA.y;\n                Asub[col][WIDTH*row + 2] = vecA.z;\n                Asub[col][WIDTH*row + 3] = vecA.w;\n            #endif\n            #if WIDTH == 1\n                Bsub[col][row] = vecB;\n            #elif WIDTH == 2\n                Bsub[col][WIDTH*row + 0] = vecB.x;\n                Bsub[col][WIDTH*row + 1] = vecB.y;\n            #elif WIDTH == 4\n                Bsub[col][WIDTH*row + 0] = vecB.x;\n                Bsub[col][WIDTH*row + 1] = vecB.y;\n                Bsub[col][WIDTH*row + 2] = vecB.z;\n                Bsub[col][WIDTH*row + 3] = vecB.w;\n            #endif\n        }\n\n        // Synchronise to make sure the tile is loaded\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Loop over the values of a single tile\n        #pragma unroll\n        for (int k=0; k<TSK; k++) {\n\n            // Cache the values of Bsub in registers\n            #ifdef USE_SHUFFLE\n                int col = tidn + (tidm % WPTN)*RTSN;\n                float val = Bsub[k][col];\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    Breg[wn] = __shfl(val, wn, WPTN);\n                }\n            #else\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    int col = tidn + wn*RTSN;\n                    Breg[wn] = Bsub[k][col];\n                }\n            #endif\n\n            /*// Cache the values of Asub in registers\n            #ifdef USE_SHUFFLE\n                for (int wn=0; wn<WPTN; wn+=(32/RTSM)) {\n                    int type = tidn % (32/RTSM);\n                    int row = tidm + (wn+type)*RTSM;\n                    float val = Asub[k][row];\n                    Areg[wn] = __shfl_up(val, RTSM);\n                    Areg[wn+1] = __shfl_down(val, RTSM);\n                }\n            #endif */\n\n            // Perform the computation\n            #pragma unroll\n            for (int wm=0; wm<WPTM; wm++) {\n                int row = tidm + wm*RTSM;\n                Areg = Asub[k][row];\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    acc[wm][wn] += Areg * Breg[wn];\n                }\n            }\n        }\n\n        // Synchronise before loading the next tile\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Next tile\n        t++;\n    } while (t<numTiles);\n\n    // Store the final results in C\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        int globalRow = offsetM + tidm + wm*RTSM;\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            int globalCol = offsetN + tidn + wn*RTSN;\n            C[globalCol*M + globalRow] = acc[wm][wn];\n        }\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 9\n\n// With pre-fetching\n__kernel void myGEMM9(const int M, const int N, const int K,\n                      const __global floatX* A,\n                      const __global floatX* B,\n                      __global float* C) {\n\n    // Thread identifiers\n    const int tidm = get_local_id(0); // Local row ID (max: TSM/WPTM == RTSM)\n    const int tidn = get_local_id(1); // Local col ID (max: TSN/WPTN == RTSN)\n    const int offsetM = TSM*get_group_id(0); // Work-group offset\n    const int offsetN = TSN*get_group_id(1); // Work-group offset\n\n    // Local memory to fit two tiles of A and B\n    __local float Asub[2][TSK*TSM];\n    __local float Bsub[2][TSK*TSN];\n\n    // Allocate register space\n    float Areg;\n    float Breg[WPTN];\n    float acc[WPTM][WPTN];\n\n    // Initialise the accumulation registers\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            acc[wm][wn] = 0.0f;\n        }\n    }\n\n    // Load the first tile of A and B into local memory\n    #pragma unroll\n    for (int la=0; la<LPTA/WIDTH; la++) {\n        int tid = tidn*RTSM + tidm;\n        int id = la*RTSN*RTSM + tid;\n        int row = MOD2(id,TSM/WIDTH);\n        int col = DIV2(id,TSM/WIDTH);\n\n        // Load the values (wide vector load)\n        int tiledIndex = TSK*0 + col;\n        int indexA = tiledIndex*(M/WIDTH) + offsetM/WIDTH + row;\n        int indexB = tiledIndex*(N/WIDTH) + offsetN/WIDTH + row;\n        #ifdef USE_LDG\n            floatX vecA = __ldg(&A[indexA]);\n            floatX vecB = __ldg(&B[indexB]);\n        #else\n            floatX vecA = A[indexA];\n            floatX vecB = B[indexB];\n        #endif\n\n        // Store the loaded vectors into local memory\n        #if WIDTH == 1\n            Asub[0][col*TSM + row] = vecA;\n        #elif WIDTH == 2\n            Asub[0][col*TSM + WIDTH*row + 0] = vecA.x;\n            Asub[0][col*TSM + WIDTH*row + 1] = vecA.y;\n        #elif WIDTH == 4\n            Asub[0][col*TSM + WIDTH*row + 0] = vecA.x;\n            Asub[0][col*TSM + WIDTH*row + 1] = vecA.y;\n            Asub[0][col*TSM + WIDTH*row + 2] = vecA.z;\n            Asub[0][col*TSM + WIDTH*row + 3] = vecA.w;\n        #endif\n        #if WIDTH == 1\n            Bsub[0][col*TSN + row] = vecB;\n        #elif WIDTH == 2\n            Bsub[0][col*TSN + WIDTH*row + 0] = vecB.x;\n            Bsub[0][col*TSN + WIDTH*row + 1] = vecB.y;\n        #elif WIDTH == 4\n            Bsub[0][col*TSN + WIDTH*row + 0] = vecB.x;\n            Bsub[0][col*TSN + WIDTH*row + 1] = vecB.y;\n            Bsub[0][col*TSN + WIDTH*row + 2] = vecB.z;\n            Bsub[0][col*TSN + WIDTH*row + 3] = vecB.w;\n        #endif\n    }\n\n    // Synchronise\n    barrier(CLK_LOCAL_MEM_FENCE);\n    \n    // Loop over all tiles\n    const int numTiles = K/TSK;\n    int t=0;\n    do {\n\n        // Load the next tile of A and B into local memory\n        int tt = t + 1;\n        if (tt < numTiles) {\n            #pragma unroll\n            for (int la=0; la<LPTA/WIDTH; la++) {\n                int tid = tidn*RTSM + tidm;\n                int id = la*RTSN*RTSM + tid;\n                int row = MOD2(id,TSM/WIDTH);\n                int col = DIV2(id,TSM/WIDTH);\n\n                // Load the values (wide vector load)\n                int tiledIndex = TSK*tt + col;\n                int indexA = tiledIndex*(M/WIDTH) + offsetM/WIDTH + row;\n                int indexB = tiledIndex*(N/WIDTH) + offsetN/WIDTH + row;\n                #ifdef USE_LDG\n                    floatX vecA = __ldg(&A[indexA]);\n                    floatX vecB = __ldg(&B[indexB]);\n                #else\n                    floatX vecA = A[indexA];\n                    floatX vecB = B[indexB];\n                #endif\n\n                // Store the loaded vectors into local memory\n                #if WIDTH == 1\n                    Asub[tt%2][col*TSM + row] = vecA;\n                #elif WIDTH == 2\n                    Asub[tt%2][col*TSM + WIDTH*row + 0] = vecA.x;\n                    Asub[tt%2][col*TSM + WIDTH*row + 1] = vecA.y;\n                #elif WIDTH == 4\n                    Asub[tt%2][col*TSM + WIDTH*row + 0] = vecA.x;\n                    Asub[tt%2][col*TSM + WIDTH*row + 1] = vecA.y;\n                    Asub[tt%2][col*TSM + WIDTH*row + 2] = vecA.z;\n                    Asub[tt%2][col*TSM + WIDTH*row + 3] = vecA.w;\n                #endif\n                #if WIDTH == 1\n                    Bsub[tt%2][col*TSN + row] = vecB;\n                #elif WIDTH == 2\n                    Bsub[tt%2][col*TSN + WIDTH*row + 0] = vecB.x;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 1] = vecB.y;\n                #elif WIDTH == 4\n                    Bsub[tt%2][col*TSN + WIDTH*row + 0] = vecB.x;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 1] = vecB.y;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 2] = vecB.z;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 3] = vecB.w;\n                #endif\n            }\n        }\n\n        // Loop over the values of a single tile\n        #pragma unroll\n        for (int k=0; k<TSK; k++) {\n\n            // Cache the values of Bsub in registers\n            #pragma unroll\n            for (int wn=0; wn<WPTN; wn++) {\n                int col = tidn + wn*RTSN;\n                Breg[wn] = Bsub[t%2][k*TSN + col];\n            }\n\n            // Perform the computation\n            #pragma unroll\n            for (int wm=0; wm<WPTM; wm++) {\n                int row = tidm + wm*RTSM;\n                Areg = Asub[t%2][k*TSM + row];\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    acc[wm][wn] += Areg * Breg[wn];\n                }\n            }\n        }\n\n        // Synchronise\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Next tile\n        t++;\n    } while (t<numTiles);\n\n    // Store the final results in C\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        int globalRow = offsetM + tidm + wm*RTSM;\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            int globalCol = offsetN + tidn + wn*RTSN;\n            C[globalCol*M + globalRow] = acc[wm][wn];\n        }\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 10\n\n#define BK TSK\n#define BN TSN\n#define BM TSM\n#define TX RTSM\n#define TY RTSN\n#define RX WPTM\n#define RY WPTN\n\n// With support for incomplete tiles and arbitrary input/output matrix sizes\n__kernel void myGEMM10(const int M, const int N, const int K,\n                       const __global floatX* A,\n                       const __global floatX* B,\n                       __global float* C) {\n\n    // Thread identifiers\n    const int tidm = get_local_id(0); // Local row ID (max: TSM/WPTM == RTSM)\n    const int tidn = get_local_id(1); // Local col ID (max: TSN/WPTN == RTSN)\n    const int gidm = get_group_id(0); // Work-group ID\n    const int gidn = get_group_id(1); // Work-group ID\n    const int tid = tidn*RTSM + tidm; // Global thread ID (max RTSM*RTSN)\n\n    // Local memory to fit two tiles of A and B\n    __local float Asub[2][TSK*TSM];\n    __local float Bsub[2][TSK*TSN];\n\n    // Allocate register space\n    float Areg;\n    float Breg[WPTN];\n    float acc[WPTM][WPTN];\n\n    // Initialise the accumulation registers\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            acc[wm][wn] = 0.0f;\n        }\n    }\n\n    // Tile A\n    #pragma unroll\n    for (int la=0; la<LPTA/WIDTH; la++) {\n        int id = la*RTSN*RTSM + tid;\n        int row = MOD2(id,TSM/WIDTH);\n        int col = DIV2(id,TSM/WIDTH);\n\n        // Load the value (wide vector load)\n        int tiledIndex = TSK*0 + col;\n        int indexA = tiledIndex*(M/WIDTH) + gidm*(TSM/WIDTH) + row;\n        #ifdef USE_LDG\n            floatX vecA = __ldg(&A[indexA]);\n        #else\n            floatX vecA = A[indexA];\n        #endif\n\n        // Store the loaded vector into local memory\n        #if WIDTH == 1\n            Asub[0][col*TSM + row] = vecA;\n        #elif WIDTH == 2\n            Asub[0][col*TSM + WIDTH*row + 0] = vecA.x;\n            Asub[0][col*TSM + WIDTH*row + 1] = vecA.y;\n        #elif WIDTH == 4\n            Asub[0][col*TSM + WIDTH*row + 0] = vecA.x;\n            Asub[0][col*TSM + WIDTH*row + 1] = vecA.y;\n            Asub[0][col*TSM + WIDTH*row + 2] = vecA.z;\n            Asub[0][col*TSM + WIDTH*row + 3] = vecA.w;\n        #endif\n    }\n\n    // Tile B\n    #pragma unroll\n    for (int lb=0; lb<LPTB/WIDTH; lb++) {\n        int id = lb*RTSN*RTSM + tid;\n        int row = MOD2(id,TSN/WIDTH);\n        int col = DIV2(id,TSN/WIDTH);\n\n        // Load the value (wide vector load)\n        int tiledIndex = TSK*0 + col;\n        int indexB = tiledIndex*(N/WIDTH) + gidn*(TSN/WIDTH) + row;\n        #ifdef USE_LDG\n            floatX vecB = __ldg(&B[indexB]);\n        #else\n            floatX vecB = B[indexB];\n        #endif\n\n        // Store the loaded vector into local memory\n        #if WIDTH == 1\n            Bsub[0][col*TSN + row] = vecB;\n        #elif WIDTH == 2\n            Bsub[0][col*TSN + WIDTH*row + 0] = vecB.x;\n            Bsub[0][col*TSN + WIDTH*row + 1] = vecB.y;\n        #elif WIDTH == 4\n            Bsub[0][col*TSN + WIDTH*row + 0] = vecB.x;\n            Bsub[0][col*TSN + WIDTH*row + 1] = vecB.y;\n            Bsub[0][col*TSN + WIDTH*row + 2] = vecB.z;\n            Bsub[0][col*TSN + WIDTH*row + 3] = vecB.w;\n        #endif\n    }\n    \n    // Loop over all tiles\n    const int numTiles = K/TSK;\n    int t=0;\n    do {\n\n        // Synchronise\n        barrier(CLK_LOCAL_MEM_FENCE);\n\n        // Load the next tile of A and B into local memory\n        int tt = t + 1;\n        if (tt < numTiles) {\n\n            // Tile A\n            #pragma unroll\n            for (int la=0; la<LPTA/WIDTH; la++) {\n                int id = la*RTSN*RTSM + tid;\n                int row = MOD2(id,TSM/WIDTH);\n                int col = DIV2(id,TSM/WIDTH);\n\n                // Load the value (wide vector load)\n                int tiledIndex = TSK*tt + col;\n                int indexA = tiledIndex*(M/WIDTH) + gidm*(TSM/WIDTH) + row;\n                #ifdef USE_LDG\n                    floatX vecA = __ldg(&A[indexA]);\n                #else\n                    floatX vecA = A[indexA];\n                #endif\n\n                // Store the loaded vector into local memory\n                #if WIDTH == 1\n                    Asub[tt%2][col*TSM + row] = vecA;\n                #elif WIDTH == 2\n                    Asub[tt%2][col*TSM + WIDTH*row + 0] = vecA.x;\n                    Asub[tt%2][col*TSM + WIDTH*row + 1] = vecA.y;\n                #elif WIDTH == 4\n                    Asub[tt%2][col*TSM + WIDTH*row + 0] = vecA.x;\n                    Asub[tt%2][col*TSM + WIDTH*row + 1] = vecA.y;\n                    Asub[tt%2][col*TSM + WIDTH*row + 2] = vecA.z;\n                    Asub[tt%2][col*TSM + WIDTH*row + 3] = vecA.w;\n                #endif\n            }\n\n            // Tile B\n            #pragma unroll\n            for (int lb=0; lb<LPTB/WIDTH; lb++) {\n                int id = lb*RTSN*RTSM + tid;\n                int row = MOD2(id,TSN/WIDTH);\n                int col = DIV2(id,TSN/WIDTH);\n\n                // Load the value (wide vector load)\n                int tiledIndex = TSK*tt + col;\n                int indexB = tiledIndex*(N/WIDTH) + gidn*(TSN/WIDTH) + row;\n                #ifdef USE_LDG\n                    floatX vecB = __ldg(&B[indexB]);\n                #else\n                    floatX vecB = B[indexB];\n                #endif\n\n                // Store the loaded vector into local memory\n                #if WIDTH == 1\n                    Bsub[tt%2][col*TSN + row] = vecB;\n                #elif WIDTH == 2\n                    Bsub[tt%2][col*TSN + WIDTH*row + 0] = vecB.x;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 1] = vecB.y;\n                #elif WIDTH == 4\n                    Bsub[tt%2][col*TSN + WIDTH*row + 0] = vecB.x;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 1] = vecB.y;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 2] = vecB.z;\n                    Bsub[tt%2][col*TSN + WIDTH*row + 3] = vecB.w;\n                #endif\n            }\n        }\n\n        // Loop over the values of a single tile\n        #pragma unroll\n        for (int k=0; k<TSK; k++) {\n\n            // Cache the values of Bsub in registers\n            #pragma unroll\n            for (int wn=0; wn<WPTN; wn++) {\n                int col = tidn + wn*RTSN;\n                Breg[wn] = Bsub[t%2][k*TSN + col];\n            }\n\n            // Perform the computation\n            #pragma unroll\n            for (int wm=0; wm<WPTM; wm++) {\n                int row = tidm + wm*RTSM;\n                Areg = Asub[t%2][k*TSM + row];\n                #pragma unroll\n                for (int wn=0; wn<WPTN; wn++) {\n                    acc[wm][wn] += Areg * Breg[wn];\n                }\n            }\n        }\n\n        // Next tile\n        t++;\n    } while (t<numTiles);\n\n    // Store the final results in C\n    #pragma unroll\n    for (int wm=0; wm<WPTM; wm++) {\n        int globalRow = gidm*TSM + tidm + wm*RTSM;\n        #pragma unroll\n        for (int wn=0; wn<WPTN; wn++) {\n            int globalCol = gidn*TSN + tidn + wn*RTSN;\n            C[globalCol*M + globalRow] = acc[wm][wn];\n        }\n    }\n}\n\n#endif\n// =================================================================================================\n#if KERNEL == 11\n\n// Typedefs for clBlas-mimic kernel (myGEMM11)\n#if RX == 2\n    typedef float2 floatA;\n    typedef float2 floatC;\n#elif RX == 4\n    typedef float4 floatA;\n    typedef float4 floatC;\n#elif RX == 8\n    typedef float8 floatA;\n    typedef float8 floatC;\n#endif\n#if RK == 2\n    typedef float2 floatB;\n#elif RK == 4\n    typedef float4 floatB;\n#elif RK == 8\n    typedef float8 floatB;\n#endif\n\n// Mimic clBlas (4x8 register tiling with vector data-types)\n__kernel void myGEMM11(const int M, const int N, const int K,\n                       const __global floatA* restrict A,\n                       const __global floatB* restrict B,\n                       __global floatC* C) {\n    \n    // Allocate register space\n    float aReg[RK][RX];\n    float bReg[RY][RK];\n    float acc[RY][RX];\n\n    // Initialise the accumulation registers\n    #pragma unroll\n    for (int y=0; y<RY; y++) {\n        for (int x=0; x<RX; x++) {\n            acc[y][x] = 0.0;\n        }\n    }\n\n    // Loop over all tiles\n    const int numTiles = K/RK;\n    for (int t=0; t<numTiles; t++) {\n\n        // Load a tile of A and B into register memory\n        #pragma unroll\n        for (int y=0; y<RY; y++) {\n\n            // Load the data\n            floatA aVec = A[(RK*t + y)*(M/RX) + get_global_id(0)];\n            floatB bVec = B[(RY*get_global_id(1) + y)*numTiles + t];\n\n            // Store the vector of A into registers\n            #if RX == 2\n                aReg[y][0] = aVec.x;\n                aReg[y][1] = aVec.y;\n            #elif RX == 4\n                aReg[y][0] = aVec.x;\n                aReg[y][1] = aVec.y;\n                aReg[y][2] = aVec.z;\n                aReg[y][3] = aVec.w;\n            #elif RX == 8\n                aReg[y][0] = aVec.s0;\n                aReg[y][1] = aVec.s1;\n                aReg[y][2] = aVec.s2;\n                aReg[y][3] = aVec.s3;\n                aReg[y][4] = aVec.s4;\n                aReg[y][5] = aVec.s5;\n                aReg[y][6] = aVec.s6;\n                aReg[y][7] = aVec.s7;\n            #endif\n\n            // Store the vector of B into registers\n            #if RK == 2\n                bReg[y][0] = bVec.x;\n                bReg[y][1] = bVec.y;\n            #elif RK == 4\n                bReg[y][0] = bVec.x;\n                bReg[y][1] = bVec.y;\n                bReg[y][2] = bVec.z;\n                bReg[y][3] = bVec.w;\n            #elif RK == 8\n                bReg[y][0] = bVec.s0;\n                bReg[y][1] = bVec.s1;\n                bReg[y][2] = bVec.s2;\n                bReg[y][3] = bVec.s3;\n                bReg[y][4] = bVec.s4;\n                bReg[y][5] = bVec.s5;\n                bReg[y][6] = bVec.s6;\n                bReg[y][7] = bVec.s7;\n            #endif\n        }\n\n        // Perform the computations\n        #pragma unroll\n        for (int k=0; k<RK; k++) {\n            #pragma unroll\n            for (int y=0; y<RY; y++) {\n                #pragma unroll\n                for (int x=0; x<RX; x++) {\n                    acc[y][x] += aReg[k][x] * bReg[y][k];\n                }\n            }\n        }\n    }\n\n    // Store the final results in C\n    #pragma unroll\n    for (int y=0; y<RY; y++) {\n        floatC accVec;\n        #if RX == 2\n            accVec.x = acc[y][0];\n            accVec.y = acc[y][1];\n        #elif RX == 4\n            accVec.x = acc[y][0];\n            accVec.y = acc[y][1];\n            accVec.z = acc[y][2];\n            accVec.w = acc[y][3];\n        #elif RX == 8\n            accVec.s0 = acc[y][0];\n            accVec.s1 = acc[y][1];\n            accVec.s2 = acc[y][2];\n            accVec.s3 = acc[y][3];\n            accVec.s4 = acc[y][4];\n            accVec.s5 = acc[y][5];\n            accVec.s6 = acc[y][6];\n            accVec.s7 = acc[y][7];\n        #endif\n        C[(y + RY*get_global_id(1)) * (M/RX) + get_global_id(0)] = accVec;\n    }\n}\n\n#endif\n// =================================================================================================\n\n// Simple transpose kernel for a P * Q matrix\n__kernel void transpose(const int P, const int Q,\n                        const __global float* input,\n                        __global float* output) {\n    \n    // Thread identifiers\n    const int tx = get_local_id(0);\n    const int ty = get_local_id(1);\n    const int ID0 = get_group_id(0)*TRANSPOSEX + tx; // 0..P\n    const int ID1 = get_group_id(1)*TRANSPOSEY + ty; // 0..Q\n\n    // Set-up the local memory for shuffling\n    __local float buffer[TRANSPOSEX][TRANSPOSEY];\n\n    // Swap the x and y coordinates to perform the rotation (coalesced)\n    if (ID0 < P && ID1 < Q) {\n        buffer[ty][tx] = input[ID1*P + ID0];\n    }\n\n    // Synchronise all threads\n    barrier(CLK_LOCAL_MEM_FENCE);\n\n    // We don't have to swap the x and y thread indices here,\n    // because that's already done in the local memory\n    const int newID0 = get_group_id(1)*TRANSPOSEY + tx;\n    const int newID1 = get_group_id(0)*TRANSPOSEX + ty;\n\n    // Store the transposed result (coalesced)\n    if (newID0 < Q && newID1 < P) {\n        output[newID1*Q + newID0] = buffer[tx][ty];\n    }\n}\n\n// =================================================================================================\n\n// Pad the P * Q matrix with zeroes to form a P_XL * Q_XL matrix\n__kernel void paddingAddZeroes(const int P, const int Q,\n                               const __global float* input,\n                               const int P_XL, const int Q_XL,\n                               __global float* output) {\n    \n    // Thread identifiers\n    const int tx = get_group_id(0)*PADDINGX + get_local_id(0); // 0..P_XL in blocks of PADDINGX\n    const int ty = get_group_id(1)*PADDINGY + get_local_id(1); // 0..Q_XL in blocks of PADDINGY\n\n    // Check whether we are within bounds of the XL matrix\n    if (tx < P_XL && ty < Q_XL) {\n\n        // Copy the input or pad a zero\n        float value;\n        if (tx < P && ty < Q) {\n            value = input[ty*P + tx];\n        }\n        else {\n            value = 0.0f;\n        }\n\n        // Store the result\n        output[ty*P_XL + tx] = value;\n    }\n}\n\n// =================================================================================================\n\n// Remove padded values from a P_XL * Q_XL matrix to form a P * Q matrix\n__kernel void paddingRemoveZeroes(const int P_XL, const int Q_XL,\n                                  const __global float* input,\n                                  const int P, const int Q,\n                                  __global float* output) {\n    \n    // Thread identifiers\n    const int tx = get_group_id(0)*PADDINGX + get_local_id(0); // 0..P in blocks of PADDINGX\n    const int ty = get_group_id(1)*PADDINGY + get_local_id(1); // 0..Q in blocks of PADDINGY\n\n\n    // Only store the result if within P * Q bounds\n    if (tx < P && ty < Q) {\n        output[ty*P + tx] = input[ty*P_XL + tx];\n    }\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "src/libclblas.cpp",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-10\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Common include\n#include \"common.h\"\n\n// Include OpenCL and clBlas\n#include <clBLAS.h>\n\n// =================================================================================================\n\n// Matrix-multiplication using the clBlas library. This function copies the input matrices to the\n// GPU, runs SGEMM, and copies the output matrix back to the CPU.\nvoid libclblas(float* A, float* B, float* C,\n               int K, int M, int N,\n               int timerID) {\n    cl_int err;\n\n    // Define OpenCL variables\n    cl_platform_id platform = 0;\n    cl_device_id device = 0;\n    cl_device_id devices[MAX_NUM_DEVICES];\n    cl_uint numDevices = 0;\n    cl_context_properties props[3] = {CL_CONTEXT_PLATFORM, 0, 0};\n    cl_context ctx = 0;\n    cl_command_queue queue = 0;\n    cl_event event = NULL;\n    char deviceName[MAX_DEVICE_NAME];\n\n    // Configure the OpenCL environment\n    err = clGetPlatformIDs(1, &platform, NULL);\n    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 0, NULL, &numDevices);\n    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, numDevices, devices, NULL);\n    device = devices[CURRENT_DEVICE];\n    props[1] = (cl_context_properties)platform;\n    ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);\n    queue = clCreateCommandQueue(ctx, device, 0, &err);\n    err = clGetDeviceInfo(device, CL_DEVICE_NAME, MAX_DEVICE_NAME, deviceName, NULL);\n    //printf(\"## %d devices, running on %d: '%s'\\n\", numDevices, CURRENT_DEVICE, deviceName);\n\n    // Configure clBlas\n    err = clblasSetup();\n\n    // Prepare OpenCL memory objects\n    cl_mem bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M*K*sizeof(*A), NULL, &err);\n    cl_mem bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K*N*sizeof(*B), NULL, &err);\n    cl_mem bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M*N*sizeof(*C), NULL, &err);\n\n    // Copy matrices to the GPU (also C to erase the results of the previous run)\n    err = clEnqueueWriteBuffer(queue, bufA, CL_TRUE, 0, M*K*sizeof(*A), A, 0, NULL, NULL);\n    err = clEnqueueWriteBuffer(queue, bufB, CL_TRUE, 0, K*N*sizeof(*B), B, 0, NULL, NULL);\n    err = clEnqueueWriteBuffer(queue, bufC, CL_TRUE, 0, M*N*sizeof(*C), C, 0, NULL, NULL);\n\n    // Run one (small) instance of clBlas first to pre-generate and compile the kernel\n    err = clblasSgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans,\n                      128, 128, 128, ALPHA,\n                      bufA, 0, 128,\n                      bufB, 0, 128, BETA,\n                      bufC, 0, 128,\n                      1, &queue, 0, NULL, &event);\n    err = clWaitForEvents(1, &event);\n\n    // Start the timed loop\n    double startTime = timer();\n    for (int r=0; r<NUM_RUNS; r++) {\n\n        // Call clBlas\n        err = clblasSgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans,\n                          M, N, K, ALPHA,\n                          bufA, 0, M,\n                          bufB, 0, K, BETA,\n                          bufC, 0, M,\n                          1, &queue, 0, NULL, &event);\n\n        // Wait for calculations to be finished\n        err = clWaitForEvents(1, &event);\n    }\n\n    // End the timed loop\n    timers[timerID].t += (timer() - startTime) / (double)NUM_RUNS;\n    timers[timerID].kf += ((long)K * (long)M * (long)N * 2) / 1000;\n\n    // Copy the output matrix C back to the CPU memory\n    err = clEnqueueReadBuffer(queue, bufC, CL_TRUE, 0, M*N*sizeof(*C), C, 0, NULL, NULL);\n\n    // Free the GPU memory objects\n    clReleaseMemObject(bufA);\n    clReleaseMemObject(bufB);\n    clReleaseMemObject(bufC);\n\n    // Clean-up OpenCL and clBlas \n    clReleaseCommandQueue(queue);\n    clReleaseContext(ctx);\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "src/libcublas.cu",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-10-30\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Common include\n#include \"common.h\"\n\n// Include CUDA and cuBLAS (API v2)\n#include <cublas_v2.h>\n\n// =================================================================================================\n\n// Matrix-multiplication using the cuBLAS library. This function copies the input matrices to the\n// GPU, runs SGEMM, and copies the output matrix back to the CPU.\nvoid libcublas(float* A, float* B, float* C,\n               int K, int M, int N,\n               int timerID) {\n\n    // cuBLAS configuration\n    cublasStatus_t status;\n    cublasHandle_t handle;\n    status = cublasCreate(&handle);\n\n    // Prepare CUDA memory objects\n    float* bufA = 0;\n    float* bufB = 0;\n    float* bufC = 0;\n    cudaMalloc((void**)&bufA, M*K*sizeof(*A));\n    cudaMalloc((void**)&bufB, K*N*sizeof(*B));\n    cudaMalloc((void**)&bufC, M*N*sizeof(*C));\n\n    // Copy matrices to the GPU (also C to erase the results of the previous run)\n    cudaMemcpy((void*)bufA, (void*)A, M*K*sizeof(*A), cudaMemcpyHostToDevice);\n    cudaMemcpy((void*)bufB, (void*)B, K*N*sizeof(*B), cudaMemcpyHostToDevice);\n    cudaMemcpy((void*)bufC, (void*)C, M*N*sizeof(*C), cudaMemcpyHostToDevice);\n\n    // Configure SGEMM\n    float alpha = ALPHA;\n    float beta = BETA;\n\n    // Start the timed loop\n    double startTime = timer();\n    for (int r=0; r<NUM_RUNS; r++) {\n\n        // Call cuBLAS\n        status = cublasSgemm(handle, CUBLAS_OP_N, CUBLAS_OP_N,\n                             M, N, K, &alpha,\n                             bufA, M,\n                             bufB, K, &beta,\n                             bufC, M);\n\n        // Wait for calculations to be finished\n        cudaDeviceSynchronize();\n    }\n\n    // End the timed loop\n    timers[timerID].t += (timer() - startTime) / (double)NUM_RUNS;\n    timers[timerID].kf += ((long)K * (long)M * (long)N * 2) / 1000;\n\n    // Copy the output matrix C back to the CPU memory\n    cudaMemcpy((void*)C, (void*)bufC, M*N*sizeof(*C), cudaMemcpyDeviceToHost);\n\n    // Free the GPU memory objects\n    cudaFree(bufA);\n    cudaFree(bufB);\n    cudaFree(bufC);\n\n    // Clean-up cuBLAS\n    status = cublasDestroy(handle);\n    if (status != CUBLAS_STATUS_SUCCESS) {\n        exit(1);\n    }\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "src/main.cpp",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-10\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Common include\n#include \"common.h\"\n\n// Global variable with timing results\nprofile_t timers[NUM_TIMERS];\n\n// =================================================================================================\n\n// Main function. This takes care of creating matrices of various sizes and iterating over the\n// different types of BLAS libraries. It also computes the error rate in terms of the L2-norm with\n// respect to cuBLAS (the 'golden' reference).\nint main(int argc, char* argv[]) {\n\n    // Start of the function\n    printf(\"\\n##\\n\");\n    srand(time(NULL));\n\n    // Compute the peak performance of the GPU\n    double peak = GPU_CLOCK * GPU_CORES * GPU_MOD;\n\n    // Print information about the different configurations\n    printf(\"## --- Configurations ---\\n\");\n    for (int c=0; c<=3; c++) {\n        #ifndef ENABLE_CUDA\n            if (c == 0 || c == 2) { continue; }\n        #endif\n        switch(c) {\n            case 0: printf(\"##    cuBLAS on '%s', peak: %.1lf GFLOPS\\n\", GPU_NAME, peak); break;\n            case 1: printf(\"##    clBlas on '%s', peak: %.1lf GFLOPS\\n\", GPU_NAME, peak); break;\n            case 2: printf(\"## myGEMM.cu on '%s', peak: %.1lf GFLOPS\\n\", GPU_NAME, peak); break;\n            case 3: printf(\"## myGEMM.cl on '%s', peak: %.1lf GFLOPS\\n\", GPU_NAME, peak); break;\n        }\n    }\n\n    // Loop over the different input/output matrix sizes\n    for (int size=MINSIZE; size<=MAXSIZE; size=size*2) {\n\n        // Set the performance counters to zero\n        for (int t=0; t<NUM_TIMERS; t++) {\n            timers[t].t = 0.0;\n            timers[t].kf = 0;\n        }\n\n        // Set the matrices to be squared (change this to get rectangular matrices)\n        const int k = size;\n        const int m = size;\n        const int n = size;\n        printf(\"##\\n\");\n        printf(\"## --- %dx%dx%d ---\\n\", k, m, n);\n\n        // Allocate memory for the matrices and fill the inputs with random numbers\n        float* A = (float*)malloc(m*k*sizeof(float*));\n        float* B = (float*)malloc(k*n*sizeof(float*));\n        float* C = (float*)malloc(m*n*sizeof(float*));\n        float* goldC = (float*)malloc(MAXSIZE*MAXSIZE*sizeof(float*));\n        for (int i=0; i<m*k; i++) {\n            A[i] = (float)rand() / (float)RAND_MAX;\n        }\n        for (int i=0; i<k*n; i++) {\n            B[i] = (float)rand() / (float)RAND_MAX;\n        }\n\n        // Run cuBLAS or clBlas first to get the 'golden' reference output\n        #ifdef ENABLE_CUDA\n            libcublas(A, B, goldC, k, m, n, NUM_TIMERS-1);\n        #else\n            libclblas(A, B, goldC, k, m, n, NUM_TIMERS-1);\n        #endif\n\n        // Loop over the configurations\n        for (int c=0; c<=3; c++) {\n\n            // Skip configurations if CUDA is disabled\n            #ifndef ENABLE_CUDA\n                if (c == 0 || c == 2) { continue; }\n            #endif\n\n            // Set the output matrix to zero (to erase the results of the previous run)\n            for (int i=0; i<m*n; i++) {\n                C[i] = 0.0f;\n            }\n\n            // Get the name of the configuration\n            char name[100];\n            switch(c) {\n                case 0: sprintf(name, \"cuBLAS\"); break;\n                case 1: sprintf(name, \"clBlas\"); break;\n                case 2: sprintf(name, \"myGEMM.cu\"); break;\n                case 3: sprintf(name, \"myGEMM.cl\"); break;\n            }\n\n            // Perform the matrix-multiplication\n            switch(c) {\n                #ifdef ENABLE_CUDA\n                    case 0: libcublas(A, B, C, k, m, n, c); break;\n                #endif\n                case 1: libclblas(A, B, C, k, m, n, c); break;\n                #ifdef ENABLE_CUDA\n                    case 2: mycublas(A, B, C, k, m, n, c); break;\n                #endif\n                case 3: myclblas(A, B, C, k, m, n, c); break;\n            }\n\n            // Compare the result to the 'golden' reference output in terms of the L2-norm\n            double L2norm = 0.0;\n            for (int i=0; i<m*n; i++) {\n                double val = C[i] - goldC[i];\n                L2norm += val*val;\n            }\n            L2norm = sqrt(L2norm);\n\n            // Print the results to screen\n            double seconds = wtime(timers[c]);\n            double performance = gflops(timers[c]);\n            double fraction = 100.0 * performance / peak;\n            printf(\"## [%9s] %6.3lf s --> %6.1lf GFLOPS (%2.0lf%%), L2 norm: %.2e\\n\",\n                   name, seconds, performance, fraction, L2norm);\n        }\n\n        // Free up the matrices\n        free(A);\n        free(B);\n        free(C);\n        free(goldC);\n    }\n\n    // End of the program\n    printf(\"##\\n\");\n    printf(\"\\n\");\n    return 0;\n}\n\n// =================================================================================================\n\n// Timer function: Measure the current time\ndouble timer(void) {\n    struct timeval Tvalue;\n    struct timezone dummy;\n    gettimeofday(&Tvalue, &dummy);\n    double etime = (double)Tvalue.tv_sec + 1.0e-6*((double)Tvalue.tv_usec);\n    return etime;\n    //return omp_get_wtime();\n}\n\n// Timer function: Get the execution time\ndouble wtime(profile_t timer) {\n    return (timer.t);\n}\n\n// Timer function: Get the GFLOPS number\ndouble gflops(profile_t timer) {\n    return ((double)timer.kf/(1000.0*1000.0)) / (timer.t);\n}\n\n// =================================================================================================\n\n// Load an OpenCL kernel from file\nchar* readKernelFile(const char* filename, long* _size) {\n\n    // Open the file\n    FILE* file = fopen(filename, \"r\");\n    if (!file) {\n        printf(\"-- Error opening file %s\\n\", filename);\n        exit(1);\n    }\n\n    // Get its size\n    fseek(file, 0, SEEK_END);\n    long size = ftell(file);\n    rewind(file);\n\n    // Read the kernel code as a string\n    char* source = (char *)malloc((size+1)*sizeof(char));\n    fread(source, 1, size*sizeof(char), file);\n    source[size] = '\\0';\n    fclose(file);\n\n    // Save the size and return the source string\n    *_size = (size+1);\n    return source;\n}\n\n// =================================================================================================\n"
  },
  {
    "path": "src/settings.h",
    "content": "\n// =================================================================================================\n// Project: \n// Exploring the performance of general matrix-multiplication on an NVIDIA Tesla K40m GPU.\n//\n// File information:\n// Institution.... SURFsara <www.surfsara.nl>\n// Author......... Cedric Nugteren <cedric.nugteren@surfsara.nl>\n// Changed at..... 2014-11-07\n// License........ MIT license\n// Tab-size....... 4 spaces\n// Line length.... 100 characters\n//\n// =================================================================================================\n\n// Select a kernel\n#define KERNEL 8\n\n// Constants for kernels 1 -- 5\n#define TS 32                        // The square-root of the 2D tile-size (== work-group dims)\n\n// Constants for kernels 3, 5\n#define WPT 8                        // The amount of work-per-thread, i.e. the thread-coarsening factor\n#define RTS (TS/WPT)                 // The reduced tile-size in one dimension\n\n// Constants for kernels 4, 7 -- 10\n#define WIDTH 4                      // The vector-width (in number of floats)\n\n// Constants for kernel 5\n#define TSDK 16                      // The tile-size in dimension K (for kernel 5 only)\n#define LPT ((TSDK*WPT)/(TS))        // The amount of loads-per-thread (assume TSN==TSM)\n\n// Constants for kernels 6 -- 10\n#define TSM 128                      // The tile-size in dimension M\n#define TSN 128                      // The tile-size in dimension N\n#define TSK 16                       // The tile-size in dimension K\n#define WPTM 8                       // The amount of work-per-thread in dimension M\n#define WPTN 8                       // The amount of work-per-thread in dimension N\n#define RTSM (TSM/WPTM)              // The reduced tile-size in dimension M (== number of threads)\n#define RTSN (TSN/WPTN)              // The reduced tile-size in dimension N (== number of threads)\n#define LPTA ((TSK*WPTM*WPTN)/(TSN)) // The amount of loads-per-thread for A\n#define LPTB ((TSK*WPTM*WPTN)/(TSM)) // The amount of loads-per-thread for B\n\n// Constraints on settings for kernels 6 -- 10\n// Note: TSM/WPTM has to be integer\n// Note: TSN/WPTN has to be integer\n// Note: TSM/WIDTH has to be integer\n// Note: TSN/WIDTH has to be integer\n// Note: (TSK*WPTM*WPTN)/(TSN*WIDTH) has to be integer\n// Note: (TSK*WPTM*WPTN)/(TSM*WIDTH) has to be integer\n\n// Constants for kernel 11 (mimicing clBlas)\n#define THREADSX 8\n#define THREADSY 8\n#define RX 8\n#define RY 4\n#define RK (RY)\n\n// Constants for the supporting transpose kernel\n#define TRANSPOSEX 16\n#define TRANSPOSEY 16\n\n// Constants for the supporting padding kernels\n#define PADDINGX 16\n#define PADDINGY 16\n\n// Macros for host and kernel code\n#define MIN(a,b) ((a) > (b)) ? (b) : (a)\n#define MAX(a,b) ((a) > (b)) ? (a) : (b)\n#define CEIL_DIV(x,y) (((x) + (y) - 1) / (y))\n#define MOD2(x,y) ((x) % (y))\n#define DIV2(x,y) ((x) / (y))\n\n// =================================================================================================\n"
  }
]