[
  {
    "path": "LICENSE.txt",
    "content": "MIT License\n\nCopyright (c) Minqin Chen\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the \"Software\"), to deal in\nthe Software without restriction, including without limitation the rights to\nuse, copy, modify, merge, publish, distribute, sublicense, and/or sell copies\nof the Software, and to permit persons to whom the Software is furnished to do\nso, subject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR\nIMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,\nFITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE\nAUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER\nLIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,\nOUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE\nSOFTWARE.\n"
  },
  {
    "path": "MANIFEST.in",
    "content": ""
  },
  {
    "path": "Makefile",
    "content": "# Uncomment for debugging\n# DEBUG := 1\n# Pretty build\n# Q ?= @\n\nCXX := g++\npython := python3\nPYTHON_HEADER_DIR := $(shell python -c 'from distutils.sysconfig import get_python_inc; print(get_python_inc())')\nPYTORCH_INCLUDES := $(shell python -c 'from torch.utils.cpp_extension import include_paths; [print(p) for p in include_paths()]')\nPYTORCH_LIBRARIES := $(shell python -c 'from torch.utils.cpp_extension import library_paths; [print(p) for p in library_paths()]')\n\nCUDA_DIR := $(shell python -c 'from torch.utils.cpp_extension import _find_cuda_home; print(_find_cuda_home())')\nWITH_ABI := $(shell python -c 'import torch; print(int(torch._C._GLIBCXX_USE_CXX11_ABI))')\nINCLUDE_DIRS := ./ $(CUDA_DIR)/include\nINCLUDE_DIRS += $(PYTHON_HEADER_DIR)\nINCLUDE_DIRS += $(PYTORCH_INCLUDES)\n\n# Custom (MKL/ATLAS/OpenBLAS) include and lib directories.\n# BLAS_INCLUDE := /path/to/your/blas\n# BLAS_LIB := /path/to/your/blas\n\nSRC_DIR := ./src\nOBJ_DIR := ./obj\nCPP_SRCS := $(wildcard $(SRC_DIR)/*.cpp)\nCU_SRCS := $(wildcard $(SRC_DIR)/*.cu)\nOBJS := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(CPP_SRCS))\nCU_OBJS := $(patsubst $(SRC_DIR)/%.cu,$(OBJ_DIR)/cuda/%.o,$(CU_SRCS))\nSTATIC_LIB := $(OBJ_DIR)/libquant_impl.a\n\n\nCUDA_ARCH := -gencode arch=compute_50,code=sm_50 \\\n\t\t-gencode arch=compute_52,code=sm_52 \\\n\t\t-gencode arch=compute_60,code=sm_60 \\\n\t\t-gencode arch=compute_61,code=sm_61 \\\n\t\t-gencode arch=compute_70,code=sm_70 \\\n\t\t-gencode arch=compute_75,code=sm_75 \\\n\t\t-gencode arch=compute_75,code=compute_75\n\n\nLIBRARIES += stdc++ cudart c10 caffe2 torch torch_python caffe2_gpu\n\n\nifeq ($(DEBUG), 1)\n\tCOMMON_FLAGS += -DDEBUG -g -O0\n\tNVCCFLAGS += -g -G # -rdc true\nelse\n\tCOMMON_FLAGS += -DNDEBUG -O3\nendif\n\nWARNINGS := -Wall -Wno-sign-compare -Wcomment\nINCLUDE_DIRS += $(BLAS_INCLUDE)\nCXXFLAGS += -MMD -MP\nCOMMON_FLAGS += $(foreach includedir,$(INCLUDE_DIRS),-I$(includedir)) \\\n\t     -DTORCH_API_INCLUDE_EXTENSION_H -D_GLIBCXX_USE_CXX11_ABI=$(WITH_ABI)\nCXXFLAGS += -pthread -fPIC -fwrapv -std=c++14 $(COMMON_FLAGS) $(WARNINGS)\nNVCCFLAGS += -std=c++14 -ccbin=$(CXX) -Xcompiler -fPIC -use_fast_math $(COMMON_FLAGS)\n\ndefault: $(STATIC_LIB)\n\n$(OBJ_DIR):\n\t@ mkdir -p $@\n\t@ mkdir -p $@/cuda\n\n$(OBJ_DIR)/%.o: $(SRC_DIR)/%.cpp | $(OBJ_DIR)\n\t@ echo CXX $<\n\t$(Q)$(CXX) $< $(CXXFLAGS) -c -o $@\n\n$(OBJ_DIR)/cuda/%.o: $(SRC_DIR)/%.cu | $(OBJ_DIR)\n\t@ echo NVCC $<\n\t$(Q)nvcc $(NVCCFLAGS) $(CUDA_ARCH) -M $< -o ${@:.o=.d} \\\n\t\t-odir $(@D)\n\t$(Q)nvcc $(NVCCFLAGS) $(CUDA_ARCH) -c $< -o $@\n\n$(STATIC_LIB): $(OBJS) $(CU_OBJS) | $(OBJ_DIR)\n\t$(RM) -f $(STATIC_LIB)\n\t$(RM) -rf build dist\n\t@ echo LD -o $@\n\tar rc $(STATIC_LIB) $(OBJS) $(CU_OBJS)\n\nbuild:\n\t$(python) setup.py build\n\nupload:\n\t$(python) setup.py sdist bdist_wheel\n\t#twine upload dist/*\n\nclean:\n\t$(RM) -rf build dist nnieqat.egg-info\n\ntest:\n\tnosetests -s tests/test_quant_impl.py --nologcapture\n\tnosetests -s tests/test_merge_freeze_bn.py --nologcapture\n\nlint:\n\tpylint nnieqat --reports=n\n\nlintfull:\n\tpylint nnieqat\n\ninstall:\n\t$(python) setup.py install \n\nuninstall:\n\t$(python) setup.py install --record install.log\n\tcat install.log | xargs rm -rf \n\t$(RM) install.log\n"
  },
  {
    "path": "README.md",
    "content": "# nnieqat-pytorch\n\nNnieqat is a quantize aware training package for  Neural Network Inference Engine(NNIE) on pytorch, it uses hisilicon quantization library to quantize module's weight and activation as fake fp32 format.\n\n\n## Table of Contents\n\n- [nnieqat-pytorch](#nnieqat-pytorch)\n  - [Table of Contents](#table-of-contents)\n  - [Installation](#installation)\n  - [Usage](#usage)\n  - [Code Examples](#code-examples)\n  - [Results](#results)\n  - [Todo](#todo)\n  - [Reference](#reference)\n\n\n<div id=\"installation\"></div>  \n\n## Installation\n\n* Supported Platforms: Linux\n* Accelerators and GPUs: NVIDIA GPUs via CUDA driver ***10.1*** or ***10.2***.\n* Dependencies:\n  * python >= 3.5, < 4\n  * llvmlite >= 0.31.0\n  * pytorch >= 1.5\n  * numba >= 0.42.0\n  * numpy >= 1.18.1\n* Install nnieqat via pypi:  \n  ```shell\n  $ pip install nnieqat\n  ```\n\n* Install nnieqat in docker(easy way to solve environment problems)：\n  ```shell\n  $ cd docker\n  $ docker build -t nnieqat-image .\n\n  ```\n* Install nnieqat via repo：\n  ```shell\n  $ git clone https://github.com/aovoc/nnieqat-pytorch\n  $ cd nnieqat-pytorch\n  $ make install\n  ```\n\n<div id=\"usage\"></div>\n\n## Usage\n\n* add quantization hook.\n\n  quantize and dequantize weight and data with HiSVP GFPQ library in forward() process.\n\n  ```python\n\n  from nnieqat import quant_dequant_weight, unquant_weight, merge_freeze_bn, register_quantization_hook\n  ...\n  ...\n    register_quantization_hook(model)\n  ...\n  ```\n\n* merge bn weight into conv and freeze bn\n\n  suggest finetuning from a well-trained model, merge_freeze_bn at beginning. do it after a few epochs of training otherwise.\n\n  ```python\n  from nnieqat import quant_dequant_weight, unquant_weight, merge_freeze_bn, register_quantization_hook\n  ...\n  ...\n      model.train()\n      model = merge_freeze_bn(model)  #it will change bn to eval() mode during training\n  ...\n  ```\n\n* Unquantize weight before update it\n\n  ```python\n  from nnieqat import quant_dequant_weight, unquant_weight, merge_freeze_bn, register_quantization_hook\n  ...\n  ...\n      model.apply(unquant_weight)  # using original weight while updating\n      optimizer.step()\n  ...\n  ```\n\n* Dump weight optimized model\n\n  ```python\n  from nnieqat import quant_dequant_weight, unquant_weight, merge_freeze_bn, register_quantization_hook\n  ...\n  ...\n      model.apply(quant_dequant_weight)\n      save_checkpoint(...)\n      model.apply(unquant_weight)\n  ...\n  ```\n\n* Using EMA with caution(Not recommended).\n\n<div id=\"examples\"></div>\n\n## Code Examples\n\n* [Cifar10 quantization aware training example][cifar10_qat]  (add nnieqat into [pytorch_cifar10_tutorial][cifar10_example])\n\n  ```python test/test_cifar10.py```\n\n* [ImageNet quantization finetuning example][imagenet_qat]  (add nnieqat into [pytorh_imagenet_main.py][imagenet_example])\n\n  ```python test/test_imagenet.py  --pretrained  path_to_imagenet_dataset```\n\n<div id=\"results\"></div>\n\n## Results  \n\n* ImageNet\n\n  ```\n  python test/test_imagenet.py /data/imgnet/ --arch squeezenet1_1  --lr 0.001 --pretrained --epoch 10   # nnie_lr_e-3_ft\n  python pytorh_imagenet_main.py /data/imgnet/ --arch squeezenet1_1  --lr 0.0001 --pretrained --epoch 10  # lr_e-4_ft\n  python test/test_imagenet.py /data/imgnet/ --arch squeezenet1_1  --lr 0.0001 --pretrained --epoch 10  # nnie_lr_e-4_ft\n  ```\n\n  finetune result：\n\n    |     | trt_fp32 | trt_int8     | nnie     |\n    | -------- |  -------- | -------- | -------- |\n    | torchvision     | 0.56992  | 0.56424  | 0.56026 |\n    | nnie_lr_e-3_ft | 0.56600   | 0.56328   | 0.56612 |\n    | lr_e-4_ft  | 0.57884   | 0.57502   | 0.57542 |\n    | nnie_lr_e-4_ft | 0.57834   | 0.57524   | 0.57730 |  \n\n\n* coco\n\nnet: simplified  yolov5s\n\ntrain 300 epoches, hi3559 test result:   \n\n Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.338   \n Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.540   \n Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.357   \n Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.187   \n Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.377   \n Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.445   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.284   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.484   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.542   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.357   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.595   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.679   \n\n\nfinetune 20 epoches, hi3559 test result:   \n\n Average Precision  (AP) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.339   \n Average Precision  (AP) @[ IoU=0.50      | area=   all | maxDets=100 ] = 0.539   \n Average Precision  (AP) @[ IoU=0.75      | area=   all | maxDets=100 ] = 0.360   \n Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.191   \n Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.378   \n Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.446   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=  1 ] = 0.285   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets= 10 ] = 0.485   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=   all | maxDets=100 ] = 0.544   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.361   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.596   \n Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.683   \n\n\n\n<div id=\"Todo\"></div>\n\n## Todo\n\n* Generate quantized model directly.\n\n<div id=\"reference\"></div>  \n\n## Reference\n\nHiSVP 量化库使用指南\n\n[Quantizing deep convolutional networks for efficient inference: A whitepaper][quant_whitepaper]\n\n[8-bit Inference with TensorRT][trt_quant]\n\n[Distilling the Knowledge in a Neural Network][distillingNN]\n\n[cifar10_qat]: https://github.com/aovoc/nnieqat-pytorch/blob/master/test/test_cifar10.py\n\n[imagenet_qat]: https://github.com/aovoc/nnieqat-pytorch/blob/master/test/test_imagenet.py\n\n[imagenet_example]: https://github.com/pytorch/examples/blob/master/imagenet/main.py\n\n[cifar10_example]: https://pytorch.org/tutorials/beginner/blitz/cifar10_tutorial.html\n\n[quant_whitepaper]: https://arxiv.org/abs/1806.08342\n\n[trt_quant]: https://on-demand.gputechconf.com/gtc/2017/presentation/s7310-8-bit-inference-with-tensorrt.pdf\n\n[distillingNN]: https://arxiv.org/abs/1503.02531\n"
  },
  {
    "path": "build_helper.py",
    "content": "import os\nimport shutil\nimport subprocess\nimport sys\nimport tempfile\nfrom distutils import ccompiler\n\n\ndef print_warning(*lines):\n    print('**************************************************')\n    for line in lines:\n        print('*** WARNING: %s' % line)\n    print('**************************************************')\n\n\ndef get_path(key):\n    return os.environ.get(key, '').split(os.pathsep)\n\n\ndef search_on_path(filenames):\n    for p in get_path('PATH'):\n        for filename in filenames:\n            full = os.path.join(p, filename)\n            if os.path.exists(full):\n                return os.path.abspath(full)\n    return None\n\n\nminimum_cuda_version = 10010\nmaxinum_cuda_version = 10030\nminimum_cudnn_version = 7000\n\n\ndef get_compiler_setting():\n    nvcc_path = search_on_path(('nvcc', 'nvcc.exe'))\n    cuda_path_default = None\n    if nvcc_path is None:\n        print_warning('nvcc not in path.', 'Please set path to nvcc.')\n    else:\n        cuda_path_default = os.path.normpath(\n            os.path.join(os.path.dirname(nvcc_path), '..'))\n\n    cuda_path = os.environ.get('CUDA_PATH', '')  # Nvidia default on Windows\n    if len(cuda_path) > 0 and cuda_path != cuda_path_default:\n        print_warning('nvcc path != CUDA_PATH',\n                      'nvcc path: %s' % cuda_path_default,\n                      'CUDA_PATH: %s' % cuda_path)\n\n    if not os.path.exists(cuda_path):\n        cuda_path = cuda_path_default\n\n    if not cuda_path and os.path.exists('/usr/local/cuda'):\n        cuda_path = '/usr/local/cuda'\n\n    include_dirs = []\n    library_dirs = []\n    define_macros = []\n\n    if cuda_path:\n        include_dirs.append(os.path.join(cuda_path, 'include'))\n        if sys.platform == 'win32':\n            library_dirs.append(os.path.join(cuda_path, 'bin'))\n            library_dirs.append(os.path.join(cuda_path, 'lib', 'x64'))\n        else:\n            library_dirs.append(os.path.join(cuda_path, 'lib64'))\n            library_dirs.append(os.path.join(cuda_path, 'lib'))\n    if sys.platform == 'darwin':\n        library_dirs.append('/usr/local/cuda/lib')\n\n    return {\n        'include_dirs': include_dirs,\n        'library_dirs': library_dirs,\n        'define_macros': define_macros,\n        'language': 'c++',\n    }\n\n\ndef check_cuda_version():\n    compiler = ccompiler.new_compiler()\n    settings = get_compiler_setting()\n    try:\n        out = build_and_run(compiler,\n                            '''\n        #include <cuda.h>\n        #include <stdio.h>\n        int main(int argc, char* argv[]) {\n          printf(\"%d\", CUDA_VERSION);\n          return 0;\n        }\n        ''',\n                            include_dirs=settings['include_dirs'])\n\n    except Exception as e:\n        print_warning('Cannot check CUDA version', str(e))\n        return False\n\n    cuda_version = int(out)\n    if cuda_version < minimum_cuda_version:\n        print_warning('CUDA version is too old: %d' % cuda_version,\n                      'CUDA v10.1 or CUDA v10.2 is required')\n        return False\n    if cuda_version > maxinum_cuda_version:\n        print_warning('CUDA version is too new: %d' % cuda_version,\n                      'CUDA v10.1 or CUDA v10.2 is required')\n\n    return True\n\n\ndef check_cudnn_version():\n    compiler = ccompiler.new_compiler()\n    settings = get_compiler_setting()\n    try:\n        out = build_and_run(compiler,\n                            '''\n        #include <cudnn.h>\n        #include <stdio.h>\n        int main(int argc, char* argv[]) {\n          printf(\"%d\", CUDNN_VERSION);\n          return 0;\n        }\n        ''',\n                            include_dirs=settings['include_dirs'])\n\n    except Exception as e:\n        print_warning('Cannot check cuDNN version\\n{0}'.format(e))\n        return False\n\n    cudnn_version = int(out)\n    if cudnn_version < minimum_cudnn_version:\n        print_warning('cuDNN version is too old: %d' % cudnn_version,\n                      'cuDNN v7 or newer is required')\n        return False\n\n    return True\n\n\ndef build_and_run(compiler,\n                  source,\n                  libraries=(),\n                  include_dirs=(),\n                  library_dirs=()):\n    temp_dir = tempfile.mkdtemp()\n\n    try:\n        fname = os.path.join(temp_dir, 'a.cpp')\n        with open(fname, 'w') as f:\n            f.write(source)\n\n        objects = compiler.compile([fname],\n                                   output_dir=temp_dir,\n                                   include_dirs=include_dirs)\n\n        try:\n            postargs = ['/MANIFEST'] if sys.platform == 'win32' else []\n            compiler.link_executable(objects,\n                                     os.path.join(temp_dir, 'a'),\n                                     libraries=libraries,\n                                     library_dirs=library_dirs,\n                                     extra_postargs=postargs,\n                                     target_lang='c++')\n        except Exception as e:\n            msg = 'Cannot build a stub file.\\nOriginal error: {0}'.format(e)\n            raise Exception(msg)\n\n        try:\n            out = subprocess.check_output(os.path.join(temp_dir, 'a'))\n            return out\n\n        except Exception as e:\n            msg = 'Cannot execute a stub file.\\nOriginal error: {0}'.format(e)\n            raise Exception(msg)\n\n    finally:\n        shutil.rmtree(temp_dir, ignore_errors=True)\n"
  },
  {
    "path": "docker/Dockerfile",
    "content": "ARG PYTORCH=\"1.6.0\"\nARG CUDA=\"10.1\"\nARG CUDNN=\"7\"\n\nFROM pytorch/pytorch:${PYTORCH}-cuda${CUDA}-cudnn${CUDNN}-devel\n\nENV TORCH_CUDA_ARCH_LIST=\"6.0 6.1 7.0+PTX\"\nENV TORCH_NVCC_FLAGS=\"-Xfatbin -compress-all\"\nENV CMAKE_PREFIX_PATH=\"$(dirname $(which conda))/../\"\n\nRUN apt-get update && apt-get install -y git ninja-build libglib2.0-0 libsm6 libxrender-dev libxext6 \\\n    && apt-get clean \\\n    && rm -rf /var/lib/apt/lists/*\n\n# Install nnieqat\nRUN pip install nnieqat\n\nWORKDIR /root/\n"
  },
  {
    "path": "docs/Makefile",
    "content": "# Minimal makefile for Sphinx documentation\n#\n\n# You can set these variables from the command line, and also\n# from the environment for the first two.\nSPHINXOPTS    ?=\nSPHINXBUILD   ?= sphinx-build\nSOURCEDIR     = source\nBUILDDIR      = build\n\n# Put it first so that \"make\" without argument is like \"make help\".\nhelp:\n\t@$(SPHINXBUILD) -M help \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)\n\n.PHONY: help Makefile\n\n# Catch-all target: route all unknown targets to Sphinx using the new\n# \"make mode\" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).\n%: Makefile\n\t@$(SPHINXBUILD) -M $@ \"$(SOURCEDIR)\" \"$(BUILDDIR)\" $(SPHINXOPTS) $(O)\n"
  },
  {
    "path": "docs/make.bat",
    "content": "@ECHO OFF\r\n\r\npushd %~dp0\r\n\r\nREM Command file for Sphinx documentation\r\n\r\nif \"%SPHINXBUILD%\" == \"\" (\r\n\tset SPHINXBUILD=sphinx-build\r\n)\r\nset SOURCEDIR=source\r\nset BUILDDIR=build\r\n\r\nif \"%1\" == \"\" goto help\r\n\r\n%SPHINXBUILD% >NUL 2>NUL\r\nif errorlevel 9009 (\r\n\techo.\r\n\techo.The 'sphinx-build' command was not found. Make sure you have Sphinx\r\n\techo.installed, then set the SPHINXBUILD environment variable to point\r\n\techo.to the full path of the 'sphinx-build' executable. Alternatively you\r\n\techo.may add the Sphinx directory to PATH.\r\n\techo.\r\n\techo.If you don't have Sphinx installed, grab it from\r\n\techo.http://sphinx-doc.org/\r\n\texit /b 1\r\n)\r\n\r\n%SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%\r\ngoto end\r\n\r\n:help\r\n%SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS% %O%\r\n\r\n:end\r\npopd\r\n"
  },
  {
    "path": "docs/source/build_helper.rst",
    "content": "build\\_helper module\n====================\n\n.. automodule:: build_helper\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/conf.py",
    "content": "# -*- coding: utf-8 -*-\n#\nimport os\nimport sys\nsys.path.insert(0, os.path.abspath('./../../'))\n\n\n# -- Project information -----------------------------------------------------\n\nproject = 'nnieqat'\ncopyright = '2020, Minqin Chen'\nauthor = 'Minqin Chen'\n\n# The short X.Y version\nversion = ''\n# The full version, including alpha/beta/rc tags\nrelease = '0.1.0'\n\n\n# -- General configuration ---------------------------------------------------\n\n# If your documentation needs a minimal Sphinx version, state it here.\n#\n# needs_sphinx = '1.0'\n\n# Add any Sphinx extension module names here, as strings. They can be\n# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom\n# ones.\nextensions = [\n    'sphinx.ext.todo',\n    'sphinx.ext.githubpages',\n    'sphinx.ext.autodoc',\n]\n\n# Add any paths that contain templates here, relative to this directory.\ntemplates_path = ['_templates']\n\n# The suffix(es) of source filenames.\n# You can specify multiple suffix as a list of string:\n#\n# source_suffix = ['.rst', '.md']\nsource_suffix = '.rst'\n\n# The master toctree document.\nmaster_doc = 'index'\n\n# The language for content autogenerated by Sphinx. Refer to documentation\n# for a list of supported languages.\n#\n# This is also used if you do content translation via gettext catalogs.\n# Usually you set \"language\" from the command line for these cases.\nlanguage = None\n\n# List of patterns, relative to source directory, that match files and\n# directories to ignore when looking for source files.\n# This pattern also affects html_static_path and html_extra_path .\nexclude_patterns = ['_build', 'Thumbs.db', '.DS_Store']\n\n# The name of the Pygments (syntax highlighting) style to use.\npygments_style = 'sphinx'\n\n\n# -- Options for HTML output -------------------------------------------------\n\n# The theme to use for HTML and HTML Help pages.  See the documentation for\n# a list of builtin themes.\n#\n\n# Theme options are theme-specific and customize the look and feel of a theme\n# further.  For a list of options available for each theme, see the\n# documentation.\n#\n# html_theme_options = {}\n\n# Add any paths that contain custom static files (such as style sheets) here,\n# relative to this directory. They are copied after the builtin static files,\n# so a file named \"default.css\" will overwrite the builtin \"default.css\".\nhtml_static_path = ['_static']\n\n# Custom sidebar templates, must be a dictionary that maps document names\n# to template names.\n#\n# The default sidebars (for documents that don't match any pattern) are\n# defined by theme itself.  Builtin themes are using these templates by\n# default: ``['localtoc.html', 'relations.html', 'sourcelink.html',\n# 'searchbox.html']``.\n#\n# html_sidebars = {}\nhtml_theme = 'sphinx_rtd_theme'\n"
  },
  {
    "path": "docs/source/index.rst",
    "content": ".. nnieqat documentation master file, created by\n   sphinx-quickstart on Fri Aug 21 03:52:34 2020.\n   You can adapt this file completely to your liking, but it should at least\n   contain the root `toctree` directive.\n\nWelcome to nnieqat's documentation!\n===================================\n\n.. toctree::\n   :maxdepth: 2\n   :caption: Contents:\n\n\n\nIndices and tables\n==================\n\n* :ref:`genindex`\n* :ref:`modindex`\n* :ref:`search`\n"
  },
  {
    "path": "docs/source/modules.rst",
    "content": "nnieqat\n=======\n\n.. toctree::\n   :maxdepth: 4\n\n   nnieqat\n"
  },
  {
    "path": "docs/source/nnieqat.cuda10.rst",
    "content": "nnieqat.cuda10 package\n======================\n\nSubmodules\n----------\n\nnnieqat.cuda10.quantize module\n------------------------------\n\n.. automodule:: nnieqat.cuda10.quantize\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n\nModule contents\n---------------\n\n.. automodule:: nnieqat.cuda10\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/nnieqat.modules.rst",
    "content": "nnieqat.modules package\n=======================\n\nSubmodules\n----------\n\nnnieqat.modules.conv module\n---------------------------\n\n.. automodule:: nnieqat.modules.conv\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\nnnieqat.modules.linear module\n-----------------------------\n\n.. automodule:: nnieqat.modules.linear\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\nnnieqat.modules.pooling module\n------------------------------\n\n.. automodule:: nnieqat.modules.pooling\n   :members:\n   :undoc-members:\n   :show-inheritance:\n\n\nModule contents\n---------------\n\n.. automodule:: nnieqat.modules\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/nnieqat.rst",
    "content": "nnieqat package\n===============\n\nSubpackages\n-----------\n\n.. toctree::\n\n   nnieqat.cuda10\n   nnieqat.gpu\n   nnieqat.modules\n\nModule contents\n---------------\n\n.. automodule:: nnieqat\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "docs/source/setup.rst",
    "content": "setup module\n============\n\n.. automodule:: setup\n   :members:\n   :undoc-members:\n   :show-inheritance:\n"
  },
  {
    "path": "nnieqat/__init__.py",
    "content": "\"\"\" quantize aware training package for  Neural Network Inference Engine(NNIE) on pytorch.\n\"\"\"\nimport sys\ntry:\n    from .quantize import quant_dequant_weight, unquant_weight, freeze_bn, \\\n        merge_freeze_bn, register_quantization_hook, test\nexcept:\n    raise\n__all__ = [\n    \"quant_dequant_weight\", \"unquant_weight\", \"freeze_bn\", \"merge_freeze_bn\", \\\n        \"register_quantization_hook\", \"test\"]\ntest()\n"
  },
  {
    "path": "nnieqat/cuda10/LICENSE.txt",
    "content": "/*\n * Copyright (c) 2018, Hisilicon Limited\n * All rights reserved.\n *\n * Redistribution and use in source and binary forms, with or without\n * modification, are permitted provided that the following conditions are met:\n *\n * 1. Redistributions of source code must retain the above copyright notice,\n * this list of conditions and the following disclaimer.\n *\n * 2. Redistributions in binary form must reproduce the above copyright notice,\n * this list of conditions and the following disclaimer in the documentation\n * and/or other materials provided with the distribution.\n *\n * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\n * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\n * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE\n * ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE\n * LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR\n * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF\n * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS\n * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN\n * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE)\n * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE\n * POSSIBILITY OF SUCH DAMAGE.\n */\n"
  },
  {
    "path": "nnieqat/quantize.py",
    "content": "#!/usr/bin/env python\n\"\"\"Quantize function.\n\"\"\"\n\nimport ctypes\nimport datetime\nimport logging\nfrom os.path import abspath, dirname\nimport torch\nimport numpy as np\nfrom numba import cuda\nfrom quant_impl import fake_quantize\n\n_USE_GFPQ_QUANT_LIB = (torch.cuda.device_count() <= 1)\n\n\nclass GFPQParamSt(ctypes.Structure):\n    r\"\"\"GFPQ param, corresponds with struct GFPQ_PARAM_ST in gfpq.hpp\"\"\"\n    _fields_ = [(\"mode\", ctypes.c_int), (\"param\", ctypes.c_byte * 16)]\n\n\nclass _types:\n    r\"\"\"Some alias types.\"\"\"\n    handle = ctypes.c_void_p\n    stream = ctypes.c_void_p\n\n\nclass QuantAndDeQuantGPU():\n    r\"\"\"quantize and dequantize data with GFPG library.\n    \"\"\"\n    def __init__(self,\n                 libquant_path=dirname(abspath(__file__)) +\n                 \"/gpu/lib/libgfpq_gpu.so\",\n                 libcublas_path=\"libcublas.so\",\n                 bit_width=8,\n                 param_mode=0):\n        global _USE_GFPQ_QUANT_LIB\n        self._bit_width = bit_width\n        if _USE_GFPQ_QUANT_LIB:\n            self._libquant = ctypes.cdll.LoadLibrary(libquant_path)\n            self._libcublas = ctypes.cdll.LoadLibrary(libcublas_path)\n            self._libcublas.cublasCreate_v2.restype = int\n            self._libcublas.cublasCreate_v2.argtypes = [ctypes.c_void_p]\n            self._cublas_handle = _types.handle()\n            self._libcublas.cublasCreate_v2(ctypes.byref(self._cublas_handle))\n            self._param = GFPQParamSt()\n            self._stream = cuda.stream()\n            self._param.mode = param_mode\n\n    def __call__(self, tensor, mode=0):\n        r\"\"\" Converts float weights to quantized weights.\n\n        Args:\n            - tensor: input data\n            - mode: GFPQ mode for param\n                GFPQ_MODE_INIT(0): There is no valid parameter in param[].\n                    Generate the parameter and filled in param[].\n                GFPQ_MODE_UPDATE(1): There is parameter in param[]. Generate\n                    new parameter, update param[] when the new parameter is\n                    better.\n                GFPQ_MODE_APPLY_ONLY(2): There is parameter in param[]. Don't\n                    generate parameter. Just use the param[].\n        \"\"\"\n\n        global _USE_GFPQ_QUANT_LIB\n        if _USE_GFPQ_QUANT_LIB:\n            try:\n                if isinstance(tensor, tuple):\n                    for tensor_item in tensor:\n                        data_cuda_array = cuda.as_cuda_array(\n                            tensor_item.data.detach())\n                        data_p = data_cuda_array.device_ctypes_pointer\n                        self._param.mode = mode\n                        ret = self._libquant.HI_GFPQ_QuantAndDeQuant_GPU_PY(\n                            data_p, data_cuda_array.size, self._bit_width,\n                            ctypes.byref(self._param), self._stream.handle,\n                            self._cublas_handle)\n                else:\n                    data_cuda_array = cuda.as_cuda_array(tensor.data.detach())\n                    data_p = data_cuda_array.device_ctypes_pointer\n                    self._param.mode = mode\n                    ret = self._libquant.HI_GFPQ_QuantAndDeQuant_GPU_PY(\n                        data_p, data_cuda_array.size, self._bit_width,\n                        ctypes.byref(self._param), self._stream.handle,\n                        self._cublas_handle)\n            except:\n                pass\n            finally:\n                if ret != 0:\n                    _USE_GFPQ_QUANT_LIB = False\n                    logger = logging.getLogger(__name__)\n                    logger.setLevel(logging.WARNING)\n                    logger.warning(\n                        \"\"\"Failed to quantize data with default HiSVP GFPQ library,\n                        Use implemented quantization algorithm instead.\"\"\")\n                    if isinstance(tensor, tuple):\n                        for tensor_item in tensor:\n                            tensor_item.data = fake_quantize(\n                                tensor_item.data.detach().clone(), self._bit_width)\n                    else:\n                        tensor.data = fake_quantize(tensor.data.detach().clone(),\n                                                    self._bit_width)\n        else:\n            if isinstance(tensor, tuple):\n                for tensor_item in tensor:\n                    tensor_item.data = fake_quantize(tensor_item.data.detach().clone(),\n                                                     self._bit_width)\n            else:\n                tensor.data = fake_quantize(tensor.data.detach().clone(),\n                                            self._bit_width)\n        return tensor\n\n\n_QUANT_HANDLE = QuantAndDeQuantGPU()\n\n\ndef _fuse_conv_bn_weights(conv_w, conv_b, bn_rm, bn_rv, bn_eps, bn_w, bn_b):\n    \"\"\" fuse convolution and batch norm's weight.\n\n    Args:\n        conv_w (torch.nn.Parameter): convolution weight.\n        conv_b (torch.nn.Parameter): convolution bias.\n        bn_rm (torch.nn.Parameter): batch norm running mean.\n        bn_rv (torch.nn.Parameter): batch norm running variance.\n        bn_eps (torch.nn.Parameter): batch norm epsilon.\n        bn_w (torch.nn.Parameter): batch norm weight.\n        bn_b (torch.nn.Parameter): batch norm weight.\n\n    Returns:\n        conv_w(torch.nn.Parameter): fused convolution weight.\n        conv_b(torch.nn.Parameter): fused convllution bias.\n    \"\"\"\n\n    if conv_b is None:\n        conv_b = bn_rm.new_zeros(bn_rm.shape)\n    bn_var_rsqrt = torch.rsqrt(bn_rv + bn_eps)\n\n    conv_w = conv_w * \\\n        (bn_w * bn_var_rsqrt).reshape([-1] + [1] * (len(conv_w.shape) - 1))\n    conv_b = (conv_b - bn_rm) * bn_var_rsqrt * bn_w + bn_b\n\n    return torch.nn.Parameter(conv_w), torch.nn.Parameter(conv_b)\n\n\ndef _fuse_conv_bn(conv, bn):\n    conv.weight, conv.bias = \\\n        _fuse_conv_bn_weights(conv.weight, conv.bias,\n                             bn.running_mean, bn.running_var, bn.eps, bn.weight, bn.bias)\n    return conv\n\n\ndef _fuse_modules(model):\n    r\"\"\"Fuses a list of modules into a single module\n\n    Fuses only the following sequence of modules:\n    conv, bn\n    All other sequences are left unchanged.\n    For these sequences, fuse modules on weight level, keep model structure unchanged.\n\n    Arguments:\n        model: Model containing the modules to be fused\n\n    Returns:\n        model with fused modules.\n\n    \"\"\"\n    children = list(model.named_children())\n    conv_module = None\n    conv_name = None\n\n    for name, child in children:\n        if isinstance(child, (torch.nn.BatchNorm1d, torch.nn.BatchNorm2d,\n                              torch.nn.BatchNorm3d)):\n            if isinstance(conv_module, (torch.nn.Conv2d, torch.nn.Conv3d)):\n                conv_module = _fuse_conv_bn(conv_module, child)\n                model._modules[conv_name] = conv_module\n                child.eval()\n                child.running_mean = child.running_mean.new_full(\n                    child.running_mean.shape, 0)\n                child.running_var = child.running_var.new_full(\n                    child.running_var.shape, 1)\n                if child.weight is not None:\n                    child.weight.data = child.weight.data.new_full(\n                        child.weight.shape, 1)\n                if child.bias is not None:\n                    child.bias.data = child.bias.data.new_full(\n                        child.bias.shape, 0)\n                child.track_running_stats = False\n                child.momentum = 0\n                child.eps = 0\n            conv_module = None\n        elif isinstance(child, (torch.nn.Conv2d, torch.nn.Conv3d)):\n            conv_module = child\n            conv_name = name\n        else:\n            _fuse_modules(child)\n    return model\n\n\ndef freeze_bn(m, freeze_bn_affine=True):\n    \"\"\"Freeze batch normalization.\n        reference: https://arxiv.org/abs/1806.08342\n\n\n    Args:\n        - m (nn.module): torch module\n        - freeze_bn_affine (bool, optional): Freeze affine scale and\n        translation factor or not. Defaults: True.\n    \"\"\"\n\n    if isinstance(\n            m,\n        (torch.nn.BatchNorm1d, torch.nn.BatchNorm2d, torch.nn.BatchNorm3d)):\n\n        m.eval()\n        if freeze_bn_affine:\n            m.weight.requires_grad = False\n            m.bias.requires_grad = False\n\n\ndef merge_freeze_bn(model):\n    \"\"\"merge batch norm's weight into convolution, then freeze it.\n\n    Args:\n        model (nn.module): model.\n\n    Returns:\n        [nn.module]: model.\n    \"\"\"\n    model = _fuse_modules(model)\n    model.apply(freeze_bn)\n    return model\n\n\ndef unquant_weight(m):\n    \"\"\" unquantize weight before update weight, avoid training turbulence.\n\n    Args:\n        - m (nn.module): torch module.\n    \"\"\"\n    try:\n        if hasattr(m, \"weight_origin\") and m.weight is not None:\n            m.weight.data.copy_(m.weight_origin.data)\n    except AttributeError:\n        pass\n    except TypeError:\n        pass\n\n\ndef quant_dequant_weight(m):\n    \"\"\" quant weight manually.\n\n    Args:\n        - m (nn.module): torch module.\n    \"\"\"\n    global _QUANT_HANDLE\n    global _USE_GFPQ_QUANT_LIB\n    quant_handle = _QUANT_HANDLE\n    if not _USE_GFPQ_QUANT_LIB:\n        quant_handle = QuantAndDeQuantGPU()\n    try:\n        if hasattr(m, \"weight_origin\") and m.weight is not None:\n            m.weight_origin.data.copy_(m.weight.data)\n            m.weight.data = quant_handle(m.weight.data.detach().clone())\n    except AttributeError:\n        pass\n    except TypeError:\n        pass\n\n\ndef _quantizing_activation(module, input, output):\n    if isinstance(\n            module,\n        (torch.nn.ReLU, torch.nn.ELU, torch.nn.LeakyReLU, torch.nn.PReLU)):\n        global _QUANT_HANDLE\n        global _USE_GFPQ_QUANT_LIB\n        quant_handle = _QUANT_HANDLE\n        if not _USE_GFPQ_QUANT_LIB:\n            quant_handle = QuantAndDeQuantGPU()\n        # print(\"quantizing activation.\")\n        # print(output[0][0][0])\n        output_type = output.dtype\n        module.activation_max_value = torch.max(torch.max(torch.abs(output.detach())), module.activation_max_value.to(output_type))\n        # print(module.activation_max_value)\n        tensor_t = torch.cat((output, torch.ones(output[0].shape).cuda().unsqueeze(0) * module.activation_max_value))\n        output.data = quant_handle(tensor_t.float())[:-1]\n        output = output.to(output_type)\n        # print(output[0][0][0])\n\n\ndef _quantizing_data(module, input):\n    global _QUANT_HANDLE\n    global _USE_GFPQ_QUANT_LIB\n    quant_handle = _QUANT_HANDLE\n    if not _USE_GFPQ_QUANT_LIB:\n        quant_handle = QuantAndDeQuantGPU()\n    # print(\"quantizing data.\")\n    # print(input[0][0][0])\n    # print(\"quantizing data.\")\n    # print(input[0][0][0])\n    # input_type = input.dtype\n    if isinstance(input, tuple):\n        for item in input:\n            item_type = item.dtype\n            item = quant_handle(item.float())\n            item.to(item_type)\n    else:\n        input = quant_handle(input.float())\n    # input = input.to(input_type)\n    # print(input[0][0][0])\n\n\ndef _quantizing_weight(module, input):\n    global _QUANT_HANDLE\n    global _USE_GFPQ_QUANT_LIB\n    quant_handle = _QUANT_HANDLE\n    if not _USE_GFPQ_QUANT_LIB:\n        quant_handle = QuantAndDeQuantGPU()\n    # print(\"quantizing weight.\")\n    # print(module.weight[0][0][0])\n    module.weight_origin.data.copy_(module.weight.data)\n    module.weight.data = quant_handle(module.weight.data.detach().clone())\n    # print(module.weight[0][0][0])\n\n\ndef register_quantization_hook(model,\n                               quant_weight=True,\n                               quant_activation=True,\n                               quant_data=False):\n    \"\"\"register quantization hook for model.\n\n    Args:\n        model (:class:`Module`): Module.\n\n    Returns:\n        Module: self\n    \"\"\"\n\n    #  weight quantizing.\n    logger = logging.getLogger(__name__)\n    logger.setLevel(logging.INFO)\n\n    for _, module in model._modules.items():\n        if len(list(module.children())) > 0:\n            register_quantization_hook(module, quant_weight, quant_activation)\n        else:\n            if quant_weight and hasattr(\n                    module,\n                    \"weight\") and module.weight is not None and not isinstance(\n                        module, (torch.nn.BatchNorm1d, torch.nn.BatchNorm2d,\n                                 torch.nn.BatchNorm3d)):\n                module.register_buffer('weight_origin', module.weight.detach().clone())\n                if quant_data:\n                    module.register_forward_pre_hook(_quantizing_data)\n                    logger.info(\"Quantizing input data of %s\", str(module))\n                module.register_forward_pre_hook(_quantizing_weight)\n                logger.info(\"Quantizing weight of %s\", str(module))\n\n            if quant_activation and isinstance(\n                    module, (torch.nn.ReLU, torch.nn.ELU, torch.nn.LeakyReLU, torch.nn.PReLU)):\n                module.register_buffer(\"activation_max_value\", torch.tensor(0, dtype=torch.float).cuda())\n                module.register_forward_hook(_quantizing_activation)\n                logger.info(\"Quantizing activation of %s\", str(module))\n\n    return model\n\n\ndef test():\n    r\"\"\" Test GFPG library QuantAndDeQuantGPU.\n    \"\"\"\n    quant_handle = QuantAndDeQuantGPU()\n    logger = logging.getLogger(__name__)\n    logger.setLevel(logging.INFO)\n    tensor = torch.Tensor(np.array([-9, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9])).cuda()\n    logging.info(\"Origin Data: \")\n    logging.info(tensor)\n\n    start_time = datetime.datetime.now()\n    quant_tensor = quant_handle(tensor)\n    end_time = datetime.datetime.now()\n\n    logging.info(\"Quant Data: \")\n    logging.info(quant_tensor)\n\n    data_expected = np.array([\n        -8.7240619659, 0.0000000000, 1.0000000000, 2.0000000000, 2.9536523819,\n        4.0000000000, 4.9674310684, 5.9073047638, 7.0250086784, 8.0000000000,\n        8.7240619659\n    ])\n\n    logging.info(\"Data expected:  \")\n    logging.info(\" \".join([str(v) for v in data_expected]))\n\n    data_diff = quant_tensor.data.detach().cpu().numpy() - data_expected\n    flag = \"success.\"\n    for num in data_diff:\n        if abs(num) > 0.000000001:\n            flag = \"failed.\"\n\n    run_time = end_time - start_time\n    logging.info(\"QuantAndDeQuantGPU time: %s\", str(run_time))\n    logging.info(\"QuantAndDeQuantGPU %s\", flag)\n"
  },
  {
    "path": "pyproject.toml",
    "content": "[build-system]\nrequires = [\"setuptools>=40.8.0\", \"wheel\"]\nbuild-backend = \"setuptools.build_meta\"\n"
  },
  {
    "path": "setup.cfg",
    "content": "[metadata]\nlicense_files = LICENSE.txt\n"
  },
  {
    "path": "setup.py",
    "content": "from setuptools import setup, find_packages\nimport pathlib\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\nfrom build_helper import check_cuda_version\nassert(check_cuda_version())\n\nimport os\nos.system('make -j%d' % os.cpu_count())\n\nhere = pathlib.Path(__file__).parent.resolve()\nlong_description = (here / 'README.md').read_text(encoding='utf-8')\n\nsetup(\n    name='nnieqat',\n    version='0.1.0',\n    description='A nnie quantization aware training tool on pytorch.',\n    long_description=long_description,\n    long_description_content_type='text/markdown',\n    url='https://github.com/aovoc/nnieqat-pytorch',\n    author='Minqin Chen',\n    author_email='minqinchen@deepglint.com',\n    license='MIT',\n    classifiers=[\n        'Development Status :: 5 - Production/Stable',\n        \"Intended Audience :: Science/Research\",\n        'Intended Audience :: Developers',\n        \"Topic :: Scientific/Engineering :: Artificial Intelligence\",\n        \"Topic :: Software Development :: Libraries :: Python Modules\",\n        'License :: OSI Approved :: MIT License',\n        'Programming Language :: Python :: 3',\n        'Programming Language :: Python :: 3.5',\n        'Programming Language :: Python :: 3.6',\n        'Programming Language :: Python :: 3.7',\n        'Programming Language :: Python :: 3.8',\n        'Programming Language :: Python :: 3 :: Only',\n    ],\n    keywords=[\n        \"quantization aware training\",\n        \"deep learning\",\n        \"neural network\",\n        \"CNN\",\n        \"machine learning\",\n    ],\n    packages=find_packages(),\n    package_data={\n        \"nnieqat\": [\"gpu/lib/*gfpq*\"],\n    },\n    python_requires='>=3.5, <4',\n    install_requires=[\n        \"torch>=1.5\",\n        \"numba>=0.42.0\",\n        \"numpy>=1.18.1\"\n    ],\n    extras_require={\n        'test': [\"torchvision>=0.4\",\n                 \"nose\",\n                 \"ddt\"\n                 ],\n        'docs': [\n            'sphinx==2.4.4',\n            'sphinx_rtd_theme'\n        ]\n    },\n    ext_modules=[\n        CUDAExtension(\n            name=\"quant_impl\",\n            sources=[\n                \"./src/fake_quantize.cpp\",\n            ],\n            libraries=['quant_impl'],\n            library_dirs=['obj'],\n        )\n    ],\n    cmdclass={'build_ext': BuildExtension},\n    test_suite=\"nnieqat.test.test_cifar10\",\n)\n"
  },
  {
    "path": "src/fake_quantize.cpp",
    "content": "#include \"fake_quantize.h\"\n\n#define CHECK_CUDA(x) TORCH_CHECK(x.is_cuda(), #x \" must be a CUDA tensor\")\n#define CHECK_CONTIGUOUS(x) TORCH_CHECK(x.is_contiguous(), #x \" must be contiguous\")\n#define CHECK_INPUT(x) CHECK_CUDA(x); CHECK_CONTIGUOUS(x)\n\nTensor fake_quantize(Tensor a, int bit_width){\n  CHECK_INPUT(a);\n  return fake_quantize_cuda(a, bit_width);\n}\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m){\n  m.def(\"fake_quantize\", &fake_quantize, \"NNIE Fake Quantization (CUDA)\");\n}"
  },
  {
    "path": "src/fake_quantize.cu",
    "content": "#include \"fake_quantize.h\"\n__global__ void fake_quantize_kernel_cuda(float* __restrict__ a,\n                                            float* o, int size,\n                                            float* max_entry,\n                                            int bit_width) {\n    if(bit_width!=8) bit_width =16;\n    int index = blockIdx.x * blockDim.x + threadIdx.x;\n    \n    if (index < size) {\n        if((*max_entry) < 1e-15 && (*max_entry) > -1e-15){\n            o[index] = 0;\n            return;\n        }\n\n        if(bit_width == 8){\n            float data_max = (*max_entry);\n            int max_entry_qdata_int =  floorf(__log2f(data_max) * 16) + 1;\n            data_max = __powf(2, __fdividef(max_entry_qdata_int, 16));\n            float data_max_floor = __powf(2, __fdividef(max_entry_qdata_int-1, 16));\n\n            if(a[index] <= data_max_floor * 0.0020395972313035  // exp(ln(256) / 128) / 512= 2^(1/16-9) = 1.0442737824274 /512 = 0.0020395972313035\n                && a[index] > - data_max * 0.0020395972313035){  \n                o[index] = 0;\n                return;\n            }\n\n            //int qdata_int = (int)(log(256 * a[index] / data_max ) / 0.04332169878499658);  //ln(256) / 128 =  0.04332169878499658\n            int qdata_int = 0;\n            if(a[index] > 0){\n                qdata_int = rintf(__fdividef(  __logf(__fdividef(256* a[index],data_max)), 0.04332169878499658));  //ln(256) / 128 =  0.04332169878\n                if(qdata_int > 127) qdata_int = 127;\n                else if(qdata_int < 0) qdata_int = 0;   \n                o[index] =  __fdividef(data_max , 256.0) *  __expf(qdata_int*0.04332169878499658);   \n            }\n            else{\n                qdata_int = - rintf(__fdividef(  __logf(__fdividef(- 256* a[index], data_max)), 0.04332169878499658));  //ln(256) / 128 =  0.04332169878\n                if(qdata_int < -127) qdata_int = -127;\n                else if(qdata_int >-1) qdata_int = -1;\n                o[index] = - __fdividef(data_max , 256.0) * __expf(- qdata_int*0.04332169878499658);\n            }\n\n        }\n        else{\n            float data_max = (*max_entry);\n            int max_entry_qdata_int =  floorf(__log2f(data_max) * 128) + 1;\n            data_max = __powf(2, __fdividef(max_entry_qdata_int, 128));\n            float data_max_floor = __powf(2, __fdividef(max_entry_qdata_int-1, 16));\n\n            \n            if(a[index] < data_max_floor *0.0019537861485404  //exp(ln(2^16)/(2^15)) / 512 = 0.0019537861485404\n                && a[index] > - data_max * 0.0019537861485404){ \n                o[index] = 0;\n                return;\n            }\n\n            int qdata_int = 0;\n            if(a[index] > 0){\n                qdata_int = rintf(__fdividef(  __logf(__fdividef(65536* a[index], data_max)), 0.00033845077175779)); \n                if(qdata_int > 32767) qdata_int = 32767;\n                else if(qdata_int <0) qdata_int = 0;\n                o[index] =  __fdividef(data_max , 65536.0) * __expf(qdata_int * 0.00033845077175779); \n            }\n            else{\n                qdata_int = - rintf(__fdividef(  __logf(__fdividef(- 65536* a[index], data_max)), 0.00033845077175779));\n                if(qdata_int < -32767) qdata_int = -32767;\n                else if(qdata_int >-1) qdata_int = -1;\n                o[index] = - __fdividef(data_max , 65536.0) * __expf(- qdata_int * 0.00033845077175779);  \n            }\n        }\n\n    }\n}\n\n\nTensor fake_quantize_cuda(Tensor a, int bit_width) {\n    auto o = at::zeros_like(a);\n    int64_t size = a.numel();\n  \n    Tensor max_entry = at::max(at::abs(a));\n    int blockSize = 1024;\n    int blockNums = (size + blockSize - 1) / blockSize;\n  \n    fake_quantize_kernel_cuda<<<blockNums, blockSize>>>(a.data_ptr<float>(),\n                                                        o.data_ptr<float>(),\n                                                        size,\n                                                        max_entry.data_ptr<float>(),\n                                                        bit_width);\n    return o;\n  }\n\n"
  },
  {
    "path": "src/fake_quantize.h",
    "content": "#include <cstdlib>\n#include <math.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n#include <climits>\n#include <stdint.h>\n#include <tuple>\n#include <ATen/ATen.h>\n#include <torch/torch.h>\n\nusing namespace at;\n\nTensor fake_quantize(Tensor a, int bit_width=8);\n\nTensor fake_quantize_cuda(Tensor a, int bit_width=8);\n\n__global__ void fake_quantize_kernel_cuda(float* __restrict__ a,\n                                            float* o, int size,\n                                            float* max_entry,\n                                            int bit_width=8);\n"
  },
  {
    "path": "src/test/Makefile",
    "content": "# Uncomment for debugging\nDEBUG := 1\n# Pretty build\n# Q ?= @\n\nCXX := g++\npython := python3\nPYTHON_HEADER_DIR := $(shell python -c 'from distutils.sysconfig import get_python_inc; print(get_python_inc())')\nPYTORCH_INCLUDES := $(shell python -c 'from torch.utils.cpp_extension import include_paths; [print(p) for p in include_paths()]')\nPYTORCH_LIBRARIES := $(shell python -c 'from torch.utils.cpp_extension import library_paths; [print(p) for p in library_paths()]')\n\nCUDA_DIR := $(shell python -c 'from torch.utils.cpp_extension import _find_cuda_home; print(_find_cuda_home())')\nWITH_ABI := $(shell python -c 'import torch; print(int(torch._C._GLIBCXX_USE_CXX11_ABI))')\nINCLUDE_DIRS := ./ $(CUDA_DIR)/include\nINCLUDE_DIRS += $(PYTHON_HEADER_DIR)\nINCLUDE_DIRS += $(PYTORCH_INCLUDES)\n\n# Custom (MKL/ATLAS/OpenBLAS) include and lib directories.\n# BLAS_INCLUDE := /path/to/your/blas\n# BLAS_LIB := /path/to/your/blas\n\nSRC_DIR := ./\nOBJ_DIR := ./obj\nCPP_SRCS := $(wildcard $(SRC_DIR)/*.cpp)\nCU_SRCS := $(wildcard $(SRC_DIR)/*.cu)\nOBJS := $(patsubst $(SRC_DIR)/%.cpp,$(OBJ_DIR)/%.o,$(CPP_SRCS))\nCU_OBJS := $(patsubst $(SRC_DIR)/%.cu,$(OBJ_DIR)/cuda/%.o,$(CU_SRCS))\nSTATIC_LIB := $(OBJ_DIR)/libquant_impl.a\n\n\nCUDA_ARCH := -gencode arch=compute_50,code=sm_50 \\\n\t\t-gencode arch=compute_52,code=sm_52 \\\n\t\t-gencode arch=compute_60,code=sm_60 \\\n\t\t-gencode arch=compute_61,code=sm_61 \\\n\t\t-gencode arch=compute_70,code=sm_70 \\\n\t\t-gencode arch=compute_75,code=sm_75 \\\n\t\t-gencode arch=compute_75,code=compute_75\n\n\nLIBRARIES += stdc++ cudart c10 caffe2 torch torch_python caffe2_gpu\n\n\nifeq ($(DEBUG), 1)\n\tCOMMON_FLAGS += -DDEBUG -g -O0\n\tNVCCFLAGS += -g -G # -rdc true\nelse\n\tCOMMON_FLAGS += -DNDEBUG -O3\nendif\n\nWARNINGS := -Wall -Wno-sign-compare -Wcomment\nINCLUDE_DIRS += $(BLAS_INCLUDE)\nCXXFLAGS += -MMD -MP\nCOMMON_FLAGS += $(foreach includedir,$(INCLUDE_DIRS),-I$(includedir)) \\\n\t     -DTORCH_API_INCLUDE_EXTENSION_H -D_GLIBCXX_USE_CXX11_ABI=$(WITH_ABI)\nCXXFLAGS += -pthread -fPIC -fwrapv -std=c++14 $(COMMON_FLAGS) $(WARNINGS)\nNVCCFLAGS += -std=c++14 -ccbin=$(CXX) -Xcompiler -fPIC -use_fast_math $(COMMON_FLAGS)\n\ndefault: $(STATIC_LIB)\n\n$(OBJ_DIR):\n\t@ mkdir -p $@\n\t@ mkdir -p $@/cuda\n\n$(OBJ_DIR)/%.o: $(SRC_DIR)/%.cpp | $(OBJ_DIR)\n\t@ echo CXX $<\n\t$(Q)$(CXX) $< $(CXXFLAGS) -c -o $@\n\n$(OBJ_DIR)/cuda/%.o: $(SRC_DIR)/%.cu | $(OBJ_DIR)\n\t@ echo NVCC $<\n\t$(Q)nvcc $(NVCCFLAGS) $(CUDA_ARCH) -M $< -o ${@:.o=.d} \\\n\t\t-odir $(@D)\n\t$(Q)nvcc $(NVCCFLAGS) $(CUDA_ARCH) -c $< -o $@\n\n$(STATIC_LIB): $(OBJS) $(CU_OBJS) | $(OBJ_DIR)\n\t$(RM) -f $(STATIC_LIB)\n\t$(RM) -rf build dist\n\t@ echo LD -o $@\n\tar rc $(STATIC_LIB) $(OBJS) $(CU_OBJS)\n\nbuild:\n\t$(python) setup.py build\n\nupload:\n\t$(python) setup.py sdist bdist_wheel\n\t#twine upload dist/*\n\nclean:\n\t$(RM) -rf build dist nnieqat.egg-info obj\n\ntest:\n\tnosetests -s tests/test_quant_impl.py --nologcapture\n\nlint:\n\tpylint nnieqat --reports=n\n\nlintfull:\n\tpylint nnieqat\n\ninstall:\n\t$(python) setup.py install \n\nuninstall:\n\t$(python) setup.py install --record install.log\n\tcat install.log | xargs rm -rf \n\t$(RM) install.log\n"
  },
  {
    "path": "src/test/test.cu",
    "content": "#include <stdio.h>\n#include \"../fake_quantize.h\"\n\nint main(int argc, char *argv[])\n{\n\tTensor input = randn({2, 2});\n\tfake_quantize(input, 8);\n\treturn 0;\n}\n"
  },
  {
    "path": "tests/test_cifar10.py",
    "content": "# -*- coding:utf-8 -*-\nfrom nnieqat import quant_dequant_weight, unquant_weight, merge_freeze_bn, register_quantization_hook\nimport unittest\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nimport torch.optim as optim\nfrom torch.autograd import Variable\nimport torchvision\nimport torchvision.transforms as transforms\n\n\n\nclass Net(nn.Module):\n    def __init__(self):\n        super(Net, self).__init__()\n        self.conv1 = torch.nn.Conv2d(3, 6, 5)\n        self.pool = torch.nn.MaxPool2d(2, 2)\n        self.conv2 = torch.nn.Conv2d(6, 16, 5)\n        self.fc1 = torch.nn.Linear(16 * 5 * 5, 120)\n        self.fc2 = torch.nn.Linear(120, 84)\n        self.fc3 = torch.nn.Linear(84, 10)\n\n    def forward(self, x):\n        x = self.pool(F.relu(self.conv1(x)))\n        x = self.pool(F.relu(self.conv2(x)))\n        x = x.view(-1, 16 * 5 * 5)\n        x = F.relu(self.fc1(x))\n        x = F.relu(self.fc2(x))\n        x = self.fc3(x)\n        return x\n\nclass TestCifar10(unittest.TestCase):\n    def test(self):\n        transform = transforms.Compose([\n            transforms.ToTensor(),\n            transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))\n        ])\n        trainset = torchvision.datasets.CIFAR10(root='./data',\n                                                train=True,\n                                                download=True,\n                                                transform=transform)\n        trainloader = torch.utils.data.DataLoader(trainset,\n                                                  batch_size=4,\n                                                  shuffle=True,\n                                                  num_workers=2)\n        testset = torchvision.datasets.CIFAR10(root='./data',\n                                               train=False,\n                                               download=True,\n                                               transform=transform)\n        testloader = torch.utils.data.DataLoader(testset,\n                                                 batch_size=4,\n                                                 shuffle=True,\n                                                 num_workers=2)\n\n        dataiter = iter(trainloader)\n        images, labels = dataiter.next()\n        net = Net()\n        register_quantization_hook(net)\n        net.cuda()\n        criterion = nn.CrossEntropyLoss()\n        optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)\n\n\n        print(\"Cifar10 training:\")\n        for epoch in range(5):\n            net.train()\n            if epoch > 2:\n                net = merge_freeze_bn(net)\n            running_loss = 0.0\n            for i, data in enumerate(trainloader, 0):\n                inputs, labels = data\n                inputs, labels = Variable(inputs.cuda()), Variable(\n                    labels.cuda())\n                optimizer.zero_grad()\n                outputs = net(inputs)\n                loss = criterion(outputs, labels)\n                loss.backward()\n                net.apply(unquant_weight)\n                optimizer.step()\n\n                running_loss += loss.item()\n                if i % 2000 == 1999:\n                    print(' epoch %3d, Iter %5d, loss: %.3f' %\n                                (epoch + 1, i + 1, running_loss / 2000))\n                    running_loss = 0.0\n        print('Finished Training.')\n\n        # net.apply(quant_dequant_weight)\n        correct = total = 0\n        for data in testloader:\n            images, labels = data\n            outputs = net(Variable(images.cuda()))\n            _, predicted = torch.max(outputs.data, 1)\n            correct += (predicted == labels.cuda()).sum()\n            total += labels.size(0)\n        print(\n            'Accuracy(10000 test images, modules\\' weight unquantize): %d %%' %\n            (100.0 * correct / total))\n\n\nif __name__ == \"__main__\":\n    suite = unittest.TestSuite()\n    suite.addTest(TestCifar10(\"test\"))\n    runner = unittest.TextTestRunner()\n    runner.run(suite)\n"
  },
  {
    "path": "tests/test_imagenet.py",
    "content": "import argparse\nimport os\nimport random\nimport shutil\nimport time\nimport warnings\n\nfrom nnieqat import quant_dequant_weight, unquant_weight, merge_freeze_bn, register_quantization_hook\nimport torch\nimport torch.nn as nn\nimport torch.nn.parallel\nimport torch.backends.cudnn as cudnn\nimport torch.distributed as dist\nimport torch.optim\nimport torch.multiprocessing as mp\nimport torch.utils.data\nimport torch.utils.data.distributed\nimport torchvision.transforms as transforms\nimport torchvision.datasets as datasets\nimport torchvision.models as models\n\nmodel_names = sorted(name for name in models.__dict__\n    if name.islower() and not name.startswith(\"__\")\n    and callable(models.__dict__[name]))\n\nparser = argparse.ArgumentParser(description='PyTorch ImageNet Training')\nparser.add_argument('data', metavar='DIR',\n                    help='path to dataset')\nparser.add_argument('-a', '--arch', metavar='ARCH', default='squeezenet1_1',\n                    choices=model_names,\n                    help='model architecture: ' +\n                        ' | '.join(model_names) +\n                        ' (default: resnet18)')\nparser.add_argument('-j', '--workers', default=32, type=int, metavar='N',\n                    help='number of data loading workers (default: 4)')\nparser.add_argument('--epochs', default=120, type=int, metavar='N',\n                    help='number of total epochs to run')\nparser.add_argument('--start-epoch', default=0, type=int, metavar='N',\n                    help='manual epoch number (useful on restarts)')\nparser.add_argument('-b', '--batch-size', default=256, type=int,\n                    metavar='N',\n                    help='mini-batch size (default: 256), this is the total '\n                         'batch size of all GPUs on the current node when '\n                         'using Data Parallel or Distributed Data Parallel')\nparser.add_argument('--lr', '--learning-rate', default=0.001, type=float,\n                    metavar='LR', help='initial learning rate', dest='lr')\nparser.add_argument('--momentum', default=0.9, type=float, metavar='M',\n                    help='momentum')\nparser.add_argument('--wd', '--weight-decay', default=1e-4, type=float,\n                    metavar='W', help='weight decay (default: 1e-4)',\n                    dest='weight_decay')\nparser.add_argument('-p', '--print-freq', default=10, type=int,\n                    metavar='N', help='print frequency (default: 10)')\nparser.add_argument('--resume', default='', type=str, metavar='PATH',\n                    help='path to latest checkpoint (default: none)')\nparser.add_argument('-e', '--evaluate', dest='evaluate', action='store_true',\n                    help='evaluate model on validation set')\nparser.add_argument('--pretrained', dest='pretrained', action='store_true',\n                    help='use pre-trained model')\nparser.add_argument('--world-size', default=-1, type=int,\n                    help='number of nodes for distributed training')\nparser.add_argument('--rank', default=-1, type=int,\n                    help='node rank for distributed training')\nparser.add_argument('--dist-url', default='tcp://224.66.41.62:23456', type=str,\n                    help='url used to set up distributed training')\nparser.add_argument('--dist-backend', default='nccl', type=str,\n                    help='distributed backend')\nparser.add_argument('--seed', default=None, type=int,\n                    help='seed for initializing training. ')\nparser.add_argument('--gpu', default=None, type=int,\n                    help='GPU id to use.')\nparser.add_argument('--multiprocessing-distributed', action='store_true',\n                    help='Use multi-processing distributed training to launch '\n                         'N processes per node, which has N GPUs. This is the '\n                         'fastest way to use PyTorch for either single node or '\n                         'multi node data parallel training')\n\nbest_acc1 = 0\n\n\ndef main():\n    args = parser.parse_args()\n\n    if args.seed is not None:\n        random.seed(args.seed)\n        torch.manual_seed(args.seed)\n        cudnn.deterministic = True\n        warnings.warn('You have chosen to seed training. '\n                      'This will turn on the CUDNN deterministic setting, '\n                      'which can slow down your training considerably! '\n                      'You may see unexpected behavior when restarting '\n                      'from checkpoints.')\n\n    if args.gpu is not None:\n        warnings.warn('You have chosen a specific GPU. This will completely '\n                      'disable data parallelism.')\n\n    if args.dist_url == \"env://\" and args.world_size == -1:\n        args.world_size = int(os.environ[\"WORLD_SIZE\"])\n\n    args.distributed = args.world_size > 1 or args.multiprocessing_distributed\n\n    ngpus_per_node = torch.cuda.device_count()\n    if args.multiprocessing_distributed:\n        # Since we have ngpus_per_node processes per node, the total world_size\n        # needs to be adjusted accordingly\n        args.world_size = ngpus_per_node * args.world_size\n        # Use torch.multiprocessing.spawn to launch distributed processes: the\n        # main_worker process function\n        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))\n    else:\n        # Simply call main_worker function\n        main_worker(args.gpu, ngpus_per_node, args)\n\n\ndef main_worker(gpu, ngpus_per_node, args):\n    global best_acc1\n    args.gpu = gpu\n\n    if args.gpu is not None:\n        print(\"Use GPU: {} for training\".format(args.gpu))\n\n    if args.distributed:\n        if args.dist_url == \"env://\" and args.rank == -1:\n            args.rank = int(os.environ[\"RANK\"])\n        if args.multiprocessing_distributed:\n            # For multiprocessing distributed training, rank needs to be the\n            # global rank among all the processes\n            args.rank = args.rank * ngpus_per_node + gpu\n        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,\n                                world_size=args.world_size, rank=args.rank)\n    # create model\n    if args.pretrained:\n        print(\"=> using pre-trained model '{}'\".format(args.arch))\n        model = models.__dict__[args.arch](pretrained=True)\n    else:\n        print(\"=> creating model '{}'\".format(args.arch))\n        model = models.__dict__[args.arch]()\n\n    register_quantization_hook(model)\n\n    if not torch.cuda.is_available():\n        print('using CPU, this will be slow')\n    elif args.distributed:\n        # For multiprocessing distributed, DistributedDataParallel constructor\n        # should always set the single device scope, otherwise,\n        # DistributedDataParallel will use all available devices.\n        if args.gpu is not None:\n            torch.cuda.set_device(args.gpu)\n            model.cuda(args.gpu)\n            # When using a single GPU per process and per\n            # DistributedDataParallel, we need to divide the batch size\n            # ourselves based on the total number of GPUs we have\n            args.batch_size = int(args.batch_size / ngpus_per_node)\n            args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)\n            model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])\n        else:\n            model.cuda()\n            # DistributedDataParallel will divide and allocate batch_size to all\n            # available GPUs if device_ids are not set\n            model = torch.nn.parallel.DistributedDataParallel(model)\n    elif args.gpu is not None:\n        torch.cuda.set_device(args.gpu)\n        model = model.cuda(args.gpu)\n    else:\n        # DataParallel will divide and allocate batch_size to all available GPUs\n        if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):\n            model.features = torch.nn.DataParallel(model.features)\n            model.cuda()\n        else:\n            model = torch.nn.DataParallel(model).cuda()\n\n    # define loss function (criterion) and optimizer\n    criterion = nn.CrossEntropyLoss().cuda(args.gpu)\n\n    optimizer = torch.optim.SGD(model.parameters(), args.lr,\n                                momentum=args.momentum,\n                                weight_decay=args.weight_decay)\n\n    # optionally resume from a checkpoint\n    if args.resume:\n        if os.path.isfile(args.resume):\n            print(\"=> loading checkpoint '{}'\".format(args.resume))\n            if args.gpu is None:\n                checkpoint = torch.load(args.resume)\n            else:\n                # Map model to be loaded to specified single gpu.\n                loc = 'cuda:{}'.format(args.gpu)\n                checkpoint = torch.load(args.resume, map_location=loc)\n            args.start_epoch = checkpoint['epoch']\n            best_acc1 = checkpoint['best_acc1']\n            if args.gpu is not None:\n                # best_acc1 may be from a checkpoint from a different GPU\n                best_acc1 = best_acc1.to(args.gpu)\n            model.load_state_dict(checkpoint['state_dict'])\n            optimizer.load_state_dict(checkpoint['optimizer'])\n            print(\"=> loaded checkpoint '{}' (epoch {})\"\n                  .format(args.resume, checkpoint['epoch']))\n        else:\n            print(\"=> no checkpoint found at '{}'\".format(args.resume))\n\n    cudnn.benchmark = True\n\n    # Data loading code\n    traindir = os.path.join(args.data, 'train')\n    valdir = os.path.join(args.data, 'val')\n    normalize = transforms.Normalize(mean=[0.485, 0.456, 0.406],\n                                     std=[0.229, 0.224, 0.225])\n\n    train_dataset = datasets.ImageFolder(\n        traindir,\n        transforms.Compose([\n            transforms.RandomResizedCrop(224),\n            transforms.RandomHorizontalFlip(),\n            transforms.ToTensor(),\n            normalize,\n        ]))\n\n    if args.distributed:\n        train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset)\n    else:\n        train_sampler = None\n\n    train_loader = torch.utils.data.DataLoader(\n        train_dataset, batch_size=args.batch_size, shuffle=(train_sampler is None),\n        num_workers=args.workers, pin_memory=True, sampler=train_sampler)\n\n    val_loader = torch.utils.data.DataLoader(\n        datasets.ImageFolder(valdir, transforms.Compose([\n            transforms.Resize(256),\n            transforms.CenterCrop(224),\n            transforms.ToTensor(),\n            normalize,\n        ])),\n        batch_size=args.batch_size, shuffle=False,\n        num_workers=args.workers, pin_memory=True)\n\n    if args.evaluate:\n        validate(val_loader, model, criterion, args)\n        return\n\n    for epoch in range(args.start_epoch, args.epochs):\n        if args.distributed:\n            train_sampler.set_epoch(epoch)\n        adjust_learning_rate(optimizer, epoch, args)\n\n        # train for one epoch\n        train(train_loader, model, criterion, optimizer, epoch, args)\n\n        # evaluate on validation set\n        acc1 = validate(val_loader, model, criterion, args)\n\n        # remember best acc@1 and save checkpoint\n        is_best = acc1 > best_acc1\n        best_acc1 = max(acc1, best_acc1)\n\n        if not args.multiprocessing_distributed or (args.multiprocessing_distributed\n                and args.rank % ngpus_per_node == 0):\n            # dump weight quantized model.\n            model.apply(quant_dequant_weight)\n            save_checkpoint({\n                'epoch': epoch + 1,\n                'arch': args.arch,\n                'state_dict': model.state_dict(),\n                'best_acc1': best_acc1,\n                'optimizer': optimizer.state_dict(),\n            }, is_best)\n            model.apply(unquant_weight)\n\n\ndef train(train_loader, model, criterion, optimizer, epoch, args):\n    batch_time = AverageMeter('Time', ':6.3f')\n    data_time = AverageMeter('Data', ':6.3f')\n    losses = AverageMeter('Loss', ':.4e')\n    top1 = AverageMeter('Acc@1', ':6.2f')\n    top5 = AverageMeter('Acc@5', ':6.2f')\n    progress = ProgressMeter(\n        len(train_loader),\n        [batch_time, data_time, losses, top1, top5],\n        prefix=\"Epoch: [{}]\".format(epoch))\n\n    # switch to train mode\n    model.train()\n    model = merge_freeze_bn(model)\n    end = time.time()\n\n    for i, (images, target) in enumerate(train_loader): \n        # measure data loading time\n        data_time.update(time.time() - end)\n\n        if args.gpu is not None:\n            images = images.cuda(args.gpu, non_blocking=True)\n        if torch.cuda.is_available():\n            target = target.cuda(args.gpu, non_blocking=True)\n\n        # compute output\n        output = model(images)\n        loss = criterion(output, target)\n\n        # measure accuracy and record loss\n        acc1, acc5 = accuracy(output, target, topk=(1, 5))\n        losses.update(loss.item(), images.size(0))\n        top1.update(acc1[0], images.size(0))\n        top5.update(acc5[0], images.size(0))\n\n        # compute gradient and do SGD step\n        optimizer.zero_grad()\n        loss.backward()\n        model.apply(unquant_weight)\n        optimizer.step()\n\n        # measure elapsed time\n        batch_time.update(time.time() - end)\n        end = time.time()\n\n        if i % args.print_freq == 0:\n            progress.display(i)\n\n\ndef validate(val_loader, model, criterion, args):\n    batch_time = AverageMeter('Time', ':6.3f')\n    losses = AverageMeter('Loss', ':.4e')\n    top1 = AverageMeter('Acc@1', ':6.2f')\n    top5 = AverageMeter('Acc@5', ':6.2f')\n    progress = ProgressMeter(\n        len(val_loader),\n        [batch_time, losses, top1, top5],\n        prefix='Test: ')\n\n    # switch to evaluate mode\n    model.eval()\n\n    with torch.no_grad():\n        end = time.time()\n        for i, (images, target) in enumerate(val_loader):\n            if args.gpu is not None:\n                images = images.cuda(args.gpu, non_blocking=True)\n            if torch.cuda.is_available():\n                target = target.cuda(args.gpu, non_blocking=True)\n\n            # compute output\n            output = model(images)\n            loss = criterion(output, target)\n\n            # measure accuracy and record loss\n            acc1, acc5 = accuracy(output, target, topk=(1, 5))\n            losses.update(loss.item(), images.size(0))\n            top1.update(acc1[0], images.size(0))\n            top5.update(acc5[0], images.size(0))\n\n            # measure elapsed time\n            batch_time.update(time.time() - end)\n            end = time.time()\n\n            if i % args.print_freq == 0:\n                progress.display(i)\n\n        # TODO: this should also be done with the ProgressMeter\n        print(' * Acc@1 {top1.avg:.3f} Acc@5 {top5.avg:.3f}'\n              .format(top1=top1, top5=top5))\n\n    return top1.avg\n\n\ndef save_checkpoint(state, is_best, filename='checkpoint.pth.tar'):\n    torch.save(state, filename)\n    if is_best:\n        shutil.copyfile(filename, 'model_best.pth.tar')\n\n\nclass AverageMeter(object):\n    \"\"\"Computes and stores the average and current value\"\"\"\n    def __init__(self, name, fmt=':f'):\n        self.name = name\n        self.fmt = fmt\n        self.reset()\n\n    def reset(self):\n        self.val = 0\n        self.avg = 0\n        self.sum = 0\n        self.count = 0\n\n    def update(self, val, n=1):\n        self.val = val\n        self.sum += val * n\n        self.count += n\n        self.avg = self.sum / self.count\n\n    def __str__(self):\n        fmtstr = '{name} {val' + self.fmt + '} ({avg' + self.fmt + '})'\n        return fmtstr.format(**self.__dict__)\n\n\nclass ProgressMeter(object):\n    def __init__(self, num_batches, meters, prefix=\"\"):\n        self.batch_fmtstr = self._get_batch_fmtstr(num_batches)\n        self.meters = meters\n        self.prefix = prefix\n\n    def display(self, batch):\n        entries = [self.prefix + self.batch_fmtstr.format(batch)]\n        entries += [str(meter) for meter in self.meters]\n        print('\\t'.join(entries))\n\n    def _get_batch_fmtstr(self, num_batches):\n        num_digits = len(str(num_batches // 1))\n        fmt = '{:' + str(num_digits) + 'd}'\n        return '[' + fmt + '/' + fmt.format(num_batches) + ']'\n\n\ndef adjust_learning_rate(optimizer, epoch, args):\n    \"\"\"Sets the learning rate to the initial LR decayed by 10 every 30 epochs\"\"\"\n    lr = args.lr * (0.975 ** (epoch // 3))\n    for param_group in optimizer.param_groups:\n        param_group['lr'] = lr\n\n\ndef accuracy(output, target, topk=(1,)):\n    \"\"\"Computes the accuracy over the k top predictions for the specified values of k\"\"\"\n    with torch.no_grad():\n        maxk = max(topk)\n        batch_size = target.size(0)\n\n        _, pred = output.topk(maxk, 1, True, True)\n        pred = pred.t()\n        correct = pred.eq(target.view(1, -1).expand_as(pred))\n\n        res = []\n        for k in topk:\n            correct_k = correct[:k].view(-1).float().sum(0, keepdim=True)\n            res.append(correct_k.mul_(100.0 / batch_size))\n        return res\n\n\nif __name__ == '__main__':\n    main()\n"
  },
  {
    "path": "tests/test_merge_freeze_bn.py",
    "content": "# -*- coding:utf-8 -*-\nimport unittest\nfrom ddt import ddt, data\nimport torch\nfrom torch import nn\nfrom nnieqat import merge_freeze_bn, freeze_bn\n\n\n@ddt\nclass TestMergeFreezeBNImpl(unittest.TestCase):\n    def conv_bn(inp,\n                oup,\n                stride,\n                conv_layer=nn.Conv2d,\n                norm_layer=nn.BatchNorm2d):\n        return nn.Sequential(conv_layer(inp, oup, 3, stride, 1, bias=False),\n                             norm_layer(oup))\n\n    def conv_1x1_bn(inp, oup, conv_layer=nn.Conv2d, norm_layer=nn.BatchNorm2d):\n        return nn.Sequential(conv_layer(inp, oup, 1, 1, 0, bias=False),\n                             norm_layer(oup))\n\n    data1 = conv_bn(3, 3, 2)\n    data2 = conv_1x1_bn(3, 3)\n\n    @data(data1, data2)\n    def test(self, m):\n        input = torch.randn(1, 3, 10, 10)\n        m.eval()\n        output_0 = m(input)\n        print(\"module parameter before merge_freeze_bn: \")\n        print(list(m.named_parameters()))\n\n        m = merge_freeze_bn(m)\n        m.eval()\n        output_1 = m(input)\n        print(\"module parameter after merge_freeze_bn: \")\n        print(list(m.named_parameters()))\n\n        print(\"output result before merge_freeze_bn: \")\n        print(output_0)\n        print(\"output result after merge_freeze_bn: \")\n        print(output_1)\n        print(\"output result diff: \")\n        print(output_0 - output_1)\n\n\nif __name__ == \"__main__\":\n    suite = unittest.TestSuite()\n    suite.addTest(TestMergeFreezeBNImpl(\"test\"))\n    runner = unittest.TextTestRunner()\n    runner.run(suite)\n"
  },
  {
    "path": "tests/test_quant_impl.py",
    "content": "# -*- coding:utf-8 -*-\nimport unittest\nfrom ddt import ddt, data\nimport math\nimport ctypes\nimport datetime\nfrom ctypes import *\nimport numpy as np\nfrom numba import cuda\nimport numpy as np\nimport os\nos.environ[\"CUDA_VISIBLE_DEVICES\"] = \"0\"\n\n@ddt\nclass TestQuantImpl(unittest.TestCase):\n    max_thres = 512\n    data0 = np.array([0])\n    data1 = np.array([v / 25600 + 1.04\n                      for v in range(25600)] + [100, max_thres])\n    data2 = np.array([v / 25600 + 1.04\n                      for v in range(25600)] + [100, max_thres])\n    data2 = np.array([-v / 25600 - 1.04\n                      for v in range(25600)] + [-100, -max_thres])\n    data3 = np.array(\n        [0, 1, 2, 2.03992188, 2.03996094, 3, 4, 5, 10, 100, max_thres])\n    max_thres = 513\n    data4 = np.array([v / 25600 + 1.04\n                      for v in range(25600)] + [100, max_thres])\n    data5 = np.array([v / 25600 + 1.04\n                      for v in range(25600)] + [100, max_thres])\n    data6 = np.array([-v / 25600 - 1.04\n                      for v in range(25600)] + [-100, -max_thres])\n    data7 = np.array(\n        [0, 1, 2, 2.03992188, 2.03996094, 3, 4, 5, 10, 100, max_thres])\n    data8 = np.array([\n        0, -1, -2, -2.03992188, -2.03996094, -3, -4, -5, -10, -100, -max_thres\n    ])\n    data9 = np.array(range(1234))\n    data10 = np.array([-v for v in range(1234)])\n\n    @data(data0, data1, data2, data3, data4, data5, data6, data7, data8, data9,\n          data10)\n    def test(self, data):\n        os.environ['CUDA_VISIBLE_DEVICES'] = '0'\n        # load library\n        dl = ctypes.cdll.LoadLibrary\n        quant_lib = dl(\"nnieqat/gpu/lib/libgfpq_gpu.so\")\n        _libcublas = ctypes.cdll.LoadLibrary(\"libcublas.so\")\n\n        # struct GFPQ_PARAM_ST in gfpq.hpp\n        class GFPQ_PARAM_ST(ctypes.Structure):\n            _fields_ = [(\"mode\", ctypes.c_int), (\"buf\", ctypes.c_byte * 16)]\n\n        class _types:\n            \"\"\"Some alias types.\"\"\"\n            handle = ctypes.c_void_p\n            stream = ctypes.c_void_p\n\n        data_origin = data.copy()\n\n        print(\n            \"----------------------------------------------------------------------\"\n        )\n        print(\"\\n\\nOriginal data:\")\n        print(data)\n\n        data = data.astype(np.float32)\n        stream = cuda.stream()\n\n        _libcublas.cublasCreate_v2.restype = int\n        _libcublas.cublasCreate_v2.argtypes = [ctypes.c_void_p]\n        cublas_handle = _types.handle()\n        _libcublas.cublasCreate_v2(ctypes.byref(cublas_handle))\n\n        data_gpu = cuda.to_device(data, stream=stream)\n        data_p = data_gpu.device_ctypes_pointer\n        bit_width = 8\n\n        param = GFPQ_PARAM_ST()\n        # init or update param first\n        param.mode = 0\n        ret = quant_lib.HI_GFPQ_QuantAndDeQuant_GPU_PY(data_p, data.size,\n                                                       bit_width,\n                                                       ctypes.byref(param),\n                                                       stream.handle,\n                                                       cublas_handle)\n        if ret != 0:\n            print(\"HI_GFPQ_QuantAndDeQuant failed(%d)\\n\" % (ret)),\n\n        # use apply param\n        param.mode = 2\n        ret = quant_lib.HI_GFPQ_QuantAndDeQuant_GPU_PY(data_p, data.size,\n                                                       bit_width,\n                                                       ctypes.byref(param),\n                                                       stream.handle,\n                                                       cublas_handle)\n        if ret != 0:\n            print(\"HI_GFPQ_QuantAndDeQuant failed(%d)\" % (ret)),\n\n        data_gpu.copy_to_host(data, stream=stream)\n        # data may not be available\n        stream.synchronize()\n        _libcublas.cublasDestroy_v2(cublas_handle)\n\n        import nnieqat\n        from quant_impl import fake_quantize\n        import torch\n        tensor = torch.Tensor(data_origin).cuda()\n        tensor.data = fake_quantize(tensor.data.detach(), 8)\n\n        diff = abs(tensor.cpu().numpy() - data)\n        # diff_thres = np.max(abs(data)) * 0.001\n        # print(\"\\nDIFF > 0.1%: \")\n        # print(\"idx: \", np.where(diff > diff_thres))\n        # print(\"Original data:\", data_origin[np.where(diff > diff_thres)])\n        # print(\"GFPQ result:\", data[np.where(diff > diff_thres)])\n        # print(\"Impl result:\", tensor.cpu().numpy()[np.where(diff > diff_thres)])\n        diff_max = np.max(diff)\n        print(\"\\nDIFF MAX: \" + str(diff_max))\n        print(\"\\nDIFF RATIO: \" +\n              str(diff_max / max(np.max(abs(data)), pow(10, -18))))\n\n\nif __name__ == \"__main__\":\n    suite = unittest.TestSuite()\n    suite.addTest(TestQuantImpl(\"test\"))\n    runner = unittest.TextTestRunner()\n    runner.run(suite)\n"
  }
]