[
  {
    "path": "DCNv2/.gitignore",
    "content": ".vscode\n.idea\n*.so\n*.o\n*pyc\n_ext\nbuild\nDCNv2.egg-info\ndist"
  },
  {
    "path": "DCNv2/LICENSE",
    "content": "BSD 3-Clause License\n\nCopyright (c) 2019, Charles Shang\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are permitted provided that the following conditions are met:\n\n1. Redistributions of source code must retain the above copyright notice, this\n   list of conditions and the following disclaimer.\n\n2. Redistributions in binary form must reproduce the above copyright notice,\n   this list of conditions and the following disclaimer in the documentation\n   and/or other materials provided with the distribution.\n\n3. Neither the name of the copyright holder nor the names of its\n   contributors may be used to endorse or promote products derived from\n   this software without specific prior written permission.\n\nTHIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\"\nAND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE\nIMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\nDISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE\nFOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL\nDAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR\nSERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER\nCAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,\nOR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE\nOF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE."
  },
  {
    "path": "DCNv2/README.md",
    "content": "## Deformable Convolutional Networks V2 with Pytorch 1.7\n\n### Build\n```bash\n    ./make.sh         # build\n    python test.py    # run examples and gradient check \n```\n\n### An Example\n- deformable conv\n```python\n    from dcn_v2 import DCN\n    input = torch.randn(2, 64, 128, 128).cuda()\n    # wrap all things (offset and mask) in DCN\n    dcn = DCN(64, 64, kernel_size=(3,3), stride=1, padding=1, deformable_groups=2).cuda()\n    output = dcn(input)\n    print(output.shape)\n```\n- deformable roi pooling\n```python\n    from dcn_v2 import DCNPooling\n    input = torch.randn(2, 32, 64, 64).cuda()\n    batch_inds = torch.randint(2, (20, 1)).cuda().float()\n    x = torch.randint(256, (20, 1)).cuda().float()\n    y = torch.randint(256, (20, 1)).cuda().float()\n    w = torch.randint(64, (20, 1)).cuda().float()\n    h = torch.randint(64, (20, 1)).cuda().float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n\n    # mdformable pooling (V2)\n    # wrap all things (offset and mask) in DCNPooling\n    dpooling = DCNPooling(spatial_scale=1.0 / 4,\n                         pooled_size=7,\n                         output_dim=32,\n                         no_trans=False,\n                         group_size=1,\n                         trans_std=0.1).cuda()\n\n    dout = dpooling(input, rois)\n```\n### Note\nNow the master branch is for pytorch 1.0 (new ATen API), you can switch back to pytorch 0.4 with,\n```bash\ngit checkout pytorch_0.4\n```\n\n### Known Issues:\n\n- [x] Gradient check w.r.t offset (solved)\n- [ ] Backward is not reentrant (minor)\n\nThis is an adaption of the official [Deformable-ConvNets](https://github.com/msracver/Deformable-ConvNets/tree/master/DCNv2_op).\n\n<s>I have ran the gradient check for many times with DOUBLE type. Every tensor **except offset** passes.\nHowever, when I set the offset to 0.5, it passes. I'm still wondering what cause this problem. Is it because some\nnon-differential points? </s>\n\nUpdate: all gradient check passes with double precision. \n\nAnother issue is that it raises `RuntimeError: Backward is not reentrant`. However, the error is very small (`<1e-7` for \nfloat `<1e-15` for double), \nso it may not be a serious problem (?)\n\nPlease post an issue or PR if you have any comments.\n    "
  },
  {
    "path": "DCNv2/__init__.py",
    "content": ""
  },
  {
    "path": "DCNv2/dcn_v2.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import absolute_import, division, print_function\n\nimport math\n\nimport torch\nfrom torch import nn\nfrom torch.autograd import Function\nfrom torch.autograd.function import once_differentiable\nfrom torch.nn.modules.utils import _pair\nimport PIL\nfrom PIL import Image\nimport os\nimport numpy as np\n\n\nimport _ext as _backend\n\n\nclass _DCNv2(Function):\n    @staticmethod\n    def forward(ctx, input, offset, mask, weight, bias, stride, padding, dilation, deformable_groups):\n        ctx.stride = _pair(stride)\n        ctx.padding = _pair(padding)\n        ctx.dilation = _pair(dilation)\n        ctx.kernel_size = _pair(weight.shape[2:4])\n        ctx.deformable_groups = deformable_groups\n        output = _backend.dcn_v2_forward(\n            input,\n            weight,\n            bias,\n            offset,\n            mask,\n            ctx.kernel_size[0],\n            ctx.kernel_size[1],\n            ctx.stride[0],\n            ctx.stride[1],\n            ctx.padding[0],\n            ctx.padding[1],\n            ctx.dilation[0],\n            ctx.dilation[1],\n            ctx.deformable_groups,\n        )\n        ctx.save_for_backward(input, offset, mask, weight, bias)\n        return output\n\n    @staticmethod\n    @once_differentiable\n    def backward(ctx, grad_output):\n        input, offset, mask, weight, bias = ctx.saved_tensors\n        grad_input, grad_offset, grad_mask, grad_weight, grad_bias = _backend.dcn_v2_backward(\n            input,\n            weight,\n            bias,\n            offset,\n            mask,\n            grad_output,\n            ctx.kernel_size[0],\n            ctx.kernel_size[1],\n            ctx.stride[0],\n            ctx.stride[1],\n            ctx.padding[0],\n            ctx.padding[1],\n            ctx.dilation[0],\n            ctx.dilation[1],\n            ctx.deformable_groups,\n        )\n\n        return (\n            grad_input,\n            grad_offset,\n            grad_mask,\n            grad_weight,\n            grad_bias,\n            None,\n            None,\n            None,\n            None,\n        )\n\n\ndcn_v2_conv = _DCNv2.apply\n\n\nclass DCNv2(nn.Module):\n    def __init__(\n        self,\n        in_channels,\n        out_channels,\n        kernel_size,\n        stride,\n        padding,\n        dilation=1,\n        deformable_groups=1,\n    ):\n        super(DCNv2, self).__init__()\n        self.in_channels = in_channels\n        self.out_channels = out_channels\n        self.kernel_size = _pair(kernel_size)\n        self.stride = _pair(stride)\n        self.padding = _pair(padding)\n        self.dilation = _pair(dilation)\n        self.deformable_groups = deformable_groups\n\n        self.weight = nn.Parameter(torch.Tensor(out_channels, in_channels, *self.kernel_size))\n        self.bias = nn.Parameter(torch.Tensor(out_channels))\n        self.reset_parameters()\n\n    def reset_parameters(self):\n        n = self.in_channels\n        for k in self.kernel_size:\n            n *= k\n        stdv = 1.0 / math.sqrt(n)\n        self.weight.data.uniform_(-stdv, stdv)\n        self.bias.data.zero_()\n\n    def forward(self, input, offset, mask):\n        assert 2 * self.deformable_groups * self.kernel_size[0] * self.kernel_size[1] == offset.shape[1]\n        assert self.deformable_groups * self.kernel_size[0] * self.kernel_size[1] == mask.shape[1]\n        return dcn_v2_conv(\n            input,\n            offset,\n            mask,\n            self.weight,\n            self.bias,\n            self.stride,\n            self.padding,\n            self.dilation,\n            self.deformable_groups,\n        )\n\n\nclass DCN(DCNv2):\n    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, dilation=1, deformable_groups=1, extra_offset_mask=False,):\n        super(DCN, self).__init__(in_channels, out_channels, kernel_size, stride, padding, dilation, deformable_groups)\n\n        self.extra_offset_mask = extra_offset_mask\n        channels_ = self.deformable_groups * 3 * self.kernel_size[0] * self.kernel_size[1]\n        self.conv_offset_mask = nn.Conv2d(self.in_channels, channels_, kernel_size=self.kernel_size, stride=self.stride, padding=self.padding, bias=True)\n        self.init_offset()\n\n    def init_offset(self):\n        self.conv_offset_mask.weight.data.zero_()\n        self.conv_offset_mask.bias.data.zero_()\n\n    def forward(self, input, main_path=None):\n        if self.extra_offset_mask:\n            out = self.conv_offset_mask(input[1])\n            input = input[0]\n        else:\n            out = self.conv_offset_mask(input)\n        o1, o2, mask = torch.chunk(out, 3, dim=1)       # each has self.deformable_groups * self.kernel_size[0] * self.kernel_size[1] channels\n        offset = torch.cat((o1, o2), dim=1)  # x, y [0-8]: the first group,\n        # o1, o2 = o1.data.cpu().numpy(), o2.data.cpu().numpy()\n        # o = o1[0]   # first image in the batch\n        # print(o1[0])\n        # return\n        # k = 0\n        # img_h, img_w = 128, 256\n        # img_r, img_c, inter = 3, 3, 5\n        # img_size = ((img_w + inter) * img_c, img_r * (img_h + inter))\n        # to_image = Image.new('RGB', img_size, 'white')\n        # print()\n        # for j in range(o.shape[0]):     # for group\n        #     if j % 9 == 0 and j != 0:  # different kernel\n        #         dst_file = os.path.join(main_path, 'offset_{}to{}.png'.format(j-9, j))\n        #         # save first\n        #         to_image.save(dst_file)\n        #         # new image\n        #         img_size = ((img_w + inter) * img_c, img_r * (img_h + inter))\n        #         to_image = Image.new('RGB', img_size, 'white')\n        #     feature_img = np.asarray(feature_img * 255, dtype=np.uint8)\n        #     for x, y in range\n        #     feature_img = Image.fromarray(cv2.cvtColor(feature_img, cv2.COLOR_BGR2RGB))\n        #     index_r, index_c = j // img_c, j % img_c\n        #     to_image.paste(feature_img, (index_c * (img_w + inter), index_r * (img_h + inter)))\n        mask = torch.sigmoid(mask)\n        return dcn_v2_conv(input, offset, mask, self.weight, self.bias, self.stride, self.padding, self.dilation, self.deformable_groups,)\n\n\nclass _DCNv2Pooling(Function):\n    @staticmethod\n    def forward(\n        ctx,\n        input,\n        rois,\n        offset,\n        spatial_scale,\n        pooled_size,\n        output_dim,\n        no_trans,\n        group_size=1,\n        part_size=None,\n        sample_per_part=4,\n        trans_std=0.0,\n    ):\n        ctx.spatial_scale = spatial_scale\n        ctx.no_trans = int(no_trans)\n        ctx.output_dim = output_dim\n        ctx.group_size = group_size\n        ctx.pooled_size = pooled_size\n        ctx.part_size = pooled_size if part_size is None else part_size\n        ctx.sample_per_part = sample_per_part\n        ctx.trans_std = trans_std\n\n        output, output_count = _backend.dcn_v2_psroi_pooling_forward(\n            input,\n            rois,\n            offset,\n            ctx.no_trans,\n            ctx.spatial_scale,\n            ctx.output_dim,\n            ctx.group_size,\n            ctx.pooled_size,\n            ctx.part_size,\n            ctx.sample_per_part,\n            ctx.trans_std,\n        )\n        ctx.save_for_backward(input, rois, offset, output_count)\n        return output\n\n    @staticmethod\n    @once_differentiable\n    def backward(ctx, grad_output):\n        input, rois, offset, output_count = ctx.saved_tensors\n        grad_input, grad_offset = _backend.dcn_v2_psroi_pooling_backward(\n            grad_output,\n            input,\n            rois,\n            offset,\n            output_count,\n            ctx.no_trans,\n            ctx.spatial_scale,\n            ctx.output_dim,\n            ctx.group_size,\n            ctx.pooled_size,\n            ctx.part_size,\n            ctx.sample_per_part,\n            ctx.trans_std,\n        )\n\n        return grad_input, None, grad_offset, None, None, None, None, None, None, None, None\n\n\ndcn_v2_pooling = _DCNv2Pooling.apply\n\n\nclass DCNv2Pooling(nn.Module):\n    def __init__(\n        self,\n        spatial_scale,\n        pooled_size,\n        output_dim,\n        no_trans,\n        group_size=1,\n        part_size=None,\n        sample_per_part=4,\n        trans_std=0.0,\n    ):\n        super(DCNv2Pooling, self).__init__()\n        self.spatial_scale = spatial_scale\n        self.pooled_size = pooled_size\n        self.output_dim = output_dim\n        self.no_trans = no_trans\n        self.group_size = group_size\n        self.part_size = pooled_size if part_size is None else part_size\n        self.sample_per_part = sample_per_part\n        self.trans_std = trans_std\n\n    def forward(self, input, rois, offset):\n        assert input.shape[1] == self.output_dim\n        if self.no_trans:\n            offset = input.new()\n        return dcn_v2_pooling(\n            input,\n            rois,\n            offset,\n            self.spatial_scale,\n            self.pooled_size,\n            self.output_dim,\n            self.no_trans,\n            self.group_size,\n            self.part_size,\n            self.sample_per_part,\n            self.trans_std,\n        )\n\n\nclass DCNPooling(DCNv2Pooling):\n    def __init__(\n        self,\n        spatial_scale,\n        pooled_size,\n        output_dim,\n        no_trans,\n        group_size=1,\n        part_size=None,\n        sample_per_part=4,\n        trans_std=0.0,\n        deform_fc_dim=1024,\n    ):\n        super(DCNPooling, self).__init__(\n            spatial_scale,\n            pooled_size,\n            output_dim,\n            no_trans,\n            group_size,\n            part_size,\n            sample_per_part,\n            trans_std,\n        )\n\n        self.deform_fc_dim = deform_fc_dim\n\n        if not no_trans:\n            self.offset_mask_fc = nn.Sequential(\n                nn.Linear(self.pooled_size * self.pooled_size * self.output_dim, self.deform_fc_dim),\n                nn.ReLU(inplace=True),\n                nn.Linear(self.deform_fc_dim, self.deform_fc_dim),\n                nn.ReLU(inplace=True),\n                nn.Linear(self.deform_fc_dim, self.pooled_size * self.pooled_size * 3),\n            )\n            self.offset_mask_fc[4].weight.data.zero_()\n            self.offset_mask_fc[4].bias.data.zero_()\n\n    def forward(self, input, rois):\n        offset = input.new()\n\n        if not self.no_trans:\n\n            # do roi_align first\n            n = rois.shape[0]\n            roi = dcn_v2_pooling(\n                input,\n                rois,\n                offset,\n                self.spatial_scale,\n                self.pooled_size,\n                self.output_dim,\n                True,  # no trans\n                self.group_size,\n                self.part_size,\n                self.sample_per_part,\n                self.trans_std,\n            )\n\n            # build mask and offset\n            offset_mask = self.offset_mask_fc(roi.view(n, -1))\n            offset_mask = offset_mask.view(n, 3, self.pooled_size, self.pooled_size)\n            o1, o2, mask = torch.chunk(offset_mask, 3, dim=1)\n            offset = torch.cat((o1, o2), dim=1)\n            mask = torch.sigmoid(mask)\n\n            # do pooling with offset and mask\n            return (\n                dcn_v2_pooling(\n                    input,\n                    rois,\n                    offset,\n                    self.spatial_scale,\n                    self.pooled_size,\n                    self.output_dim,\n                    self.no_trans,\n                    self.group_size,\n                    self.part_size,\n                    self.sample_per_part,\n                    self.trans_std,\n                )\n                * mask\n            )\n        # only roi_align\n        return dcn_v2_pooling(\n            input,\n            rois,\n            offset,\n            self.spatial_scale,\n            self.pooled_size,\n            self.output_dim,\n            self.no_trans,\n            self.group_size,\n            self.part_size,\n            self.sample_per_part,\n            self.trans_std,\n        )\n"
  },
  {
    "path": "DCNv2/make.sh",
    "content": "#!/usr/bin/env bash\npython setup.py build develop\n"
  },
  {
    "path": "DCNv2/setup.py",
    "content": "#!/usr/bin/env python\n\nimport glob\nimport os\n\nimport torch\nfrom setuptools import find_packages, setup\nfrom torch.utils.cpp_extension import CUDA_HOME, CppExtension, CUDAExtension\n\nrequirements = [\"torch\", \"torchvision\"]\n\n\ndef get_extensions():\n    this_dir = os.path.dirname(os.path.abspath(__file__))\n    extensions_dir = os.path.join(this_dir, \"src\")\n\n    main_file = glob.glob(os.path.join(extensions_dir, \"*.cpp\"))\n    source_cpu = glob.glob(os.path.join(extensions_dir, \"cpu\", \"*.cpp\"))\n    source_cuda = glob.glob(os.path.join(extensions_dir, \"cuda\", \"*.cu\"))\n\n    os.environ[\"CC\"] = \"g++\"\n    sources = main_file + source_cpu\n    extension = CppExtension\n    extra_compile_args = {\"cxx\": []}\n    define_macros = []\n\n    if torch.cuda.is_available() and CUDA_HOME is not None:\n        extension = CUDAExtension\n        sources += source_cuda\n        define_macros += [(\"WITH_CUDA\", None)]\n        extra_compile_args[\"nvcc\"] = [\n            \"-DCUDA_HAS_FP16=1\",\n            \"-D__CUDA_NO_HALF_OPERATORS__\",\n            \"-D__CUDA_NO_HALF_CONVERSIONS__\",\n            \"-D__CUDA_NO_HALF2_OPERATORS__\",\n        ]\n    else:\n        # raise NotImplementedError('Cuda is not available')\n        pass\n\n    sources = [os.path.join(extensions_dir, s) for s in sources]\n    include_dirs = [extensions_dir]\n    ext_modules = [\n        extension(\n            \"_ext\",\n            sources,\n            include_dirs=include_dirs,\n            define_macros=define_macros,\n            extra_compile_args=extra_compile_args,\n        )\n    ]\n    return ext_modules\n\n\nsetup(\n    name=\"DCNv2\",\n    version=\"0.1\",\n    author=\"charlesshang\",\n    url=\"https://github.com/charlesshang/DCNv2\",\n    description=\"deformable convolutional networks\",\n    packages=find_packages(\n        exclude=(\n            \"configs\",\n            \"tests\",\n        )\n    ),\n    # install_requires=requirements,\n    ext_modules=get_extensions(),\n    cmdclass={\"build_ext\": torch.utils.cpp_extension.BuildExtension},\n)\n"
  },
  {
    "path": "DCNv2/src/cpu/dcn_v2_cpu.cpp",
    "content": "#include <vector>\n#include \"cpu/dcn_v2_im2col_cpu.h\"\n#include <iostream>\n\n#include <ATen/ATen.h>\n//#include <ATen/cuda/CUDAContext.h>\n\n#include <TH/TH.h>\n//#include <THC/THCAtomics.cuh>\n//#include <THC/THCDeviceUtils.cuh>\n\n//extern THCState *state;\n\n// author: Charles Shang\n// https://github.com/torch/cunn/blob/master/lib/THCUNN/generic/SpatialConvolutionMM.cu\n\n// modified from the CUDA version for CPU use by Daniel K. Suhendro\n\n// edit by: James Bockman and Matthew Howe\n// modified for torch implementation to remove use of deprecated torch access to Blas\n\nat::Tensor\ndcn_v2_cpu_forward(const at::Tensor &input,\n                    const at::Tensor &weight,\n                    const at::Tensor &bias,\n                    const at::Tensor &offset,\n                    const at::Tensor &mask,\n                    const int kernel_h,\n                    const int kernel_w,\n                    const int stride_h,\n                    const int stride_w,\n                    const int pad_h,\n                    const int pad_w,\n                    const int dilation_h,\n                    const int dilation_w,\n                    const int deformable_group)\n{\n    // THCAssertSameGPU(THCudaTensor_checkGPU(state, 5, input, weight, bias, offset, mask));\n    /*AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n    AT_ASSERTM(weight.type().is_cuda(), \"weight must be a CUDA tensor\");\n    AT_ASSERTM(bias.type().is_cuda(), \"bias must be a CUDA tensor\");\n    AT_ASSERTM(offset.type().is_cuda(), \"offset must be a CUDA tensor\");\n    AT_ASSERTM(mask.type().is_cuda(), \"mask must be a CUDA tensor\");*/\n\n    const int batch = input.size(0);\n    const int channels = input.size(1);\n    const int height = input.size(2);\n    const int width = input.size(3);\n\n    const int channels_out = weight.size(0);\n    const int channels_kernel = weight.size(1);\n    const int kernel_h_ = weight.size(2);\n    const int kernel_w_ = weight.size(3);\n\n    // printf(\"Kernels: %d %d %d %d\\n\", kernel_h_, kernel_w_, kernel_w, kernel_h);\n    // printf(\"Channels: %d %d\\n\", channels, channels_kernel);\n    // printf(\"Channels: %d %d\\n\", channels_out, channels_kernel);\n\n    AT_ASSERTM(kernel_h_ == kernel_h && kernel_w_ == kernel_w,\n               \"Input shape and kernel shape wont match: (%d x %d vs %d x %d).\", kernel_h_, kernel_w, kernel_h_, kernel_w_);\n\n    AT_ASSERTM(channels == channels_kernel,\n               \"Input shape and kernel channels wont match: (%d vs %d).\", channels, channels_kernel);\n\n    const int height_out = (height + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;\n    const int width_out = (width + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;\n\n    // auto ones = at::ones({height_out, width_out}, input.options());\n    auto ones = at::ones({bias.sizes()[0], height_out, width_out}, input.options());\n    auto columns = at::empty({channels * kernel_h * kernel_w, 1 * height_out * width_out}, input.options());\n    auto output = at::zeros({batch, channels_out, height_out, width_out}, input.options());\n\n    using scalar_t = float;\n    for (int b = 0; b < batch; b++)\n    {\n        auto input_n = input.select(0, b);\n        auto offset_n = offset.select(0, b);\n        auto mask_n = mask.select(0, b);\n        auto output_n = output.select(0, b);\n        // std::cout << \"output_n: \" << output_n << \"output.select(0,b): \" << output.select(0,b) << \"\\n\"; \n\n        // Do Bias first:\n        // M,N,K are dims of matrix A and B\n        // (see http://docs.nvidia.com/cuda/cublas/#cublas-lt-t-gt-gemm)\n        // (N x 1) (1 x M)\n\n        // torch implementation\n        auto ones_T = at::transpose(ones.contiguous(), 2, 0);\n        ones_T = at::mul(ones_T, bias.contiguous());\n        ones_T = at::transpose(ones_T, 2, 0);\n        output_n = at::add(output_n, ones_T);\n\n        modulated_deformable_im2col_cpu(input_n.data_ptr<scalar_t>(),\n                                         offset_n.data_ptr<scalar_t>(),\n                                         mask_n.data_ptr<scalar_t>(),\n                                         1, channels, height, width,\n                                         height_out, width_out, kernel_h, kernel_w,\n                                         pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w,\n                                         deformable_group,\n                                         columns.data_ptr<scalar_t>());\n\n        //(k * m)  x  (m * n)\n        // Y = WC\n\n        // torch implementation\n        auto weight_flat = weight.view({channels_out, channels * kernel_h * kernel_w});\n        auto product = at::matmul(weight_flat, columns);\n        output.select(0, b) = at::add(output_n, product.view({channels_out, height_out, width_out}));\n    }\n    return output;\n}\n\nstd::vector<at::Tensor> dcn_v2_cpu_backward(const at::Tensor &input,\n                                             const at::Tensor &weight,\n                                             const at::Tensor &bias,\n                                             const at::Tensor &offset,\n                                             const at::Tensor &mask,\n                                             const at::Tensor &grad_output,\n                                             int kernel_h, int kernel_w,\n                                             int stride_h, int stride_w,\n                                             int pad_h, int pad_w,\n                                             int dilation_h, int dilation_w,\n                                             int deformable_group)\n{\n\n    THArgCheck(input.is_contiguous(), 1, \"input tensor has to be contiguous\");\n    THArgCheck(weight.is_contiguous(), 2, \"weight tensor has to be contiguous\");\n\n    /*AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n    AT_ASSERTM(weight.type().is_cuda(), \"weight must be a CUDA tensor\");\n    AT_ASSERTM(bias.type().is_cuda(), \"bias must be a CUDA tensor\");\n    AT_ASSERTM(offset.type().is_cuda(), \"offset must be a CUDA tensor\");\n    AT_ASSERTM(mask.type().is_cuda(), \"mask must be a CUDA tensor\");*/\n\n    const int batch = input.size(0);\n    const int channels = input.size(1);\n    const int height = input.size(2);\n    const int width = input.size(3);\n\n    const int channels_out = weight.size(0);\n    const int channels_kernel = weight.size(1);\n    const int kernel_h_ = weight.size(2);\n    const int kernel_w_ = weight.size(3);\n\n    AT_ASSERTM(kernel_h_ == kernel_h && kernel_w_ == kernel_w,\n               \"Input shape and kernel shape wont match: (%d x %d vs %d x %d).\", kernel_h_, kernel_w, kernel_h_, kernel_w_);\n\n    AT_ASSERTM(channels == channels_kernel,\n               \"Input shape and kernel channels wont match: (%d vs %d).\", channels, channels_kernel);\n\n    const int height_out = (height + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;\n    const int width_out = (width + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;\n\n    auto ones = at::ones({height_out, width_out}, input.options());\n    auto columns = at::zeros({channels * kernel_h * kernel_w, 1 * height_out * width_out}, input.options());\n    auto output = at::empty({batch, channels_out, height_out, width_out}, input.options());\n\n    auto grad_input = at::zeros_like(input);\n    auto grad_weight = at::zeros_like(weight);\n    auto grad_bias = at::zeros_like(bias);\n    auto grad_offset = at::zeros_like(offset);\n    auto grad_mask = at::zeros_like(mask);\n\n    using scalar_t = float;\n\n    for (int b = 0; b < batch; b++)\n    {\n        auto input_n = input.select(0, b);\n        auto offset_n = offset.select(0, b);\n        auto mask_n = mask.select(0, b);\n        auto grad_output_n = grad_output.select(0, b);\n        auto grad_input_n = grad_input.select(0, b);\n        auto grad_offset_n = grad_offset.select(0, b);\n        auto grad_mask_n = grad_mask.select(0, b);\n\n\n\n        // Torch implementation\n        auto weight_flat = weight.view({channels_out, channels*kernel_h*kernel_w});\n        weight_flat = at::transpose(weight_flat, 1, 0);\n        auto grad_output_n_flat = grad_output_n.view({channels_out, height_out*width_out});\n        columns = at::matmul(weight_flat, grad_output_n_flat);\n\n        // gradient w.r.t. input coordinate data\n        modulated_deformable_col2im_coord_cpu(columns.data_ptr<scalar_t>(),\n                                               input_n.data_ptr<scalar_t>(),\n                                               offset_n.data_ptr<scalar_t>(),\n                                               mask_n.data_ptr<scalar_t>(),\n                                               1, channels, height, width,\n                                               height_out, width_out, kernel_h, kernel_w,\n                                               pad_h, pad_w, stride_h, stride_w,\n                                               dilation_h, dilation_w, deformable_group,\n                                               grad_offset_n.data_ptr<scalar_t>(),\n                                               grad_mask_n.data_ptr<scalar_t>());\n        // gradient w.r.t. input data\n        modulated_deformable_col2im_cpu(columns.data_ptr<scalar_t>(),\n                                         offset_n.data_ptr<scalar_t>(),\n                                         mask_n.data_ptr<scalar_t>(),\n                                         1, channels, height, width,\n                                         height_out, width_out, kernel_h, kernel_w,\n                                         pad_h, pad_w, stride_h, stride_w,\n                                         dilation_h, dilation_w, deformable_group,\n                                         grad_input_n.data_ptr<scalar_t>());\n\n        // gradient w.r.t. weight, dWeight should accumulate across the batch and group\n        modulated_deformable_im2col_cpu(input_n.data_ptr<scalar_t>(),\n                                         offset_n.data_ptr<scalar_t>(),\n                                         mask_n.data_ptr<scalar_t>(),\n                                         1, channels, height, width,\n                                         height_out, width_out, kernel_h, kernel_w,\n                                         pad_h, pad_w, stride_h, stride_w,\n                                         dilation_h, dilation_w, deformable_group,\n                                         columns.data_ptr<scalar_t>());\n\n        // Torch implementation\n        auto product = at::matmul(grad_output_n_flat, at::transpose(columns, 1, 0));\n        grad_weight = at::add(grad_weight, product.view({channels_out, channels, kernel_h, kernel_w}));\n\n\n        // Torch implementation\n        auto ones_flat = ones.view({height_out*width_out});\n        product = at::matmul(grad_output_n_flat, ones_flat);\n        grad_bias = at::add(grad_bias, product);\n    }\n\n    return {\n        grad_input, grad_offset, grad_mask, grad_weight, grad_bias\n    };\n}\n"
  },
  {
    "path": "DCNv2/src/cpu/dcn_v2_im2col_cpu.cpp",
    "content": "#include \"dcn_v2_im2col_cpu.h\"\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#include <ATen/ATen.h>\n//#include <ATen/cuda/CUDAContext.h>\n\n#include <TH/TH.h>\n//#include <THC/THCAtomics.cuh>\n//#include <THC/THCDeviceUtils.cuh>\n\n// modified from the CUDA version for CPU use by Daniel K. Suhendro\n\n/*#define CUDA_KERNEL_LOOP(i, n)                          \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \\\n      i < (n);                                          \\\n      i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N)\n{\n  return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;\n}*/\n\n\nfloat dmcn_im2col_bilinear_cpu(const float *bottom_data, const int data_width,\n                           const int height, const int width, float h, float w)\n{\n  int h_low = floor(h);\n  int w_low = floor(w);\n  int h_high = h_low + 1;\n  int w_high = w_low + 1;\n\n  float lh = h - h_low;\n  float lw = w - w_low;\n  float hh = 1 - lh, hw = 1 - lw;\n\n  float v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n    v1 = bottom_data[h_low * data_width + w_low];\n  float v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n    v2 = bottom_data[h_low * data_width + w_high];\n  float v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n    v3 = bottom_data[h_high * data_width + w_low];\n  float v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n    v4 = bottom_data[h_high * data_width + w_high];\n\n  float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\nfloat dmcn_get_gradient_weight_cpu(float argmax_h, float argmax_w,\n                               const int h, const int w, const int height, const int width)\n{\n  if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)\n  {\n    //empty\n    return 0;\n  }\n\n  int argmax_h_low = floor(argmax_h);\n  int argmax_w_low = floor(argmax_w);\n  int argmax_h_high = argmax_h_low + 1;\n  int argmax_w_high = argmax_w_low + 1;\n\n  float weight = 0;\n  if (h == argmax_h_low && w == argmax_w_low)\n    weight = (h + 1 - argmax_h) * (w + 1 - argmax_w);\n  if (h == argmax_h_low && w == argmax_w_high)\n    weight = (h + 1 - argmax_h) * (argmax_w + 1 - w);\n  if (h == argmax_h_high && w == argmax_w_low)\n    weight = (argmax_h + 1 - h) * (w + 1 - argmax_w);\n  if (h == argmax_h_high && w == argmax_w_high)\n    weight = (argmax_h + 1 - h) * (argmax_w + 1 - w);\n  return weight;\n}\n\nfloat dmcn_get_coordinate_weight_cpu(float argmax_h, float argmax_w,\n                                 const int height, const int width, const float *im_data,\n                                 const int data_width, const int bp_dir)\n{\n  if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)\n  {\n    //empty\n    return 0;\n  }\n\n  int argmax_h_low = floor(argmax_h);\n  int argmax_w_low = floor(argmax_w);\n  int argmax_h_high = argmax_h_low + 1;\n  int argmax_w_high = argmax_w_low + 1;\n\n  float weight = 0;\n\n  if (bp_dir == 0)\n  {\n    if (argmax_h_low >= 0 && argmax_w_low >= 0)\n      weight += -1 * (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_low * data_width + argmax_w_low];\n    if (argmax_h_low >= 0 && argmax_w_high <= width - 1)\n      weight += -1 * (argmax_w - argmax_w_low) * im_data[argmax_h_low * data_width + argmax_w_high];\n    if (argmax_h_high <= height - 1 && argmax_w_low >= 0)\n      weight += (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_high * data_width + argmax_w_low];\n    if (argmax_h_high <= height - 1 && argmax_w_high <= width - 1)\n      weight += (argmax_w - argmax_w_low) * im_data[argmax_h_high * data_width + argmax_w_high];\n  }\n  else if (bp_dir == 1)\n  {\n    if (argmax_h_low >= 0 && argmax_w_low >= 0)\n      weight += -1 * (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_low];\n    if (argmax_h_low >= 0 && argmax_w_high <= width - 1)\n      weight += (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_high];\n    if (argmax_h_high <= height - 1 && argmax_w_low >= 0)\n      weight += -1 * (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_low];\n    if (argmax_h_high <= height - 1 && argmax_w_high <= width - 1)\n      weight += (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_high];\n  }\n\n  return weight;\n}\n\nvoid modulated_deformable_im2col_cpu_kernel(const int n, const float *data_im, const float *data_offset, const float *data_mask,\n                                                       const int height, const int width, const int kernel_h, const int kernel_w,\n                                                       const int pad_h, const int pad_w,\n                                                       const int stride_h, const int stride_w,\n                                                       const int dilation_h, const int dilation_w,\n                                                       const int channel_per_deformable_group,\n                                                       const int batch_size, const int num_channels, const int deformable_group,\n                                                       const int height_col, const int width_col,\n                                                       float *data_col)\n{\n  // launch channels * batch_size * height_col * width_col cores\n  for(int index=0; index<n; index++)\n  {\n    // NOTE(CharlesShang): different from Dai Jifeng's MXNet implementation, col_buffer is of shape (c*kw*kh, N, oh, ow)\n    // here columns is of shape (N, c*kw*kh, oh * ow), need to adapt axis\n\n    // index index of output matrix\n    const int w_col = index % width_col;\n    const int h_col = (index / width_col) % height_col;\n    // const int b_col = (index / width_col / height_col) % batch_size;\n    const int b_col = (index / width_col / height_col / num_channels) % batch_size;\n    // const int c_im = (index / width_col / height_col) / batch_size;\n    const int c_im = (index / width_col / height_col) % num_channels;\n    // const int c_col = c_im * kernel_h * kernel_w;\n    const int c_col = c_im * kernel_h * kernel_w;\n\n    // compute deformable group index\n    const int deformable_group_index = c_im / channel_per_deformable_group;\n\n    const int h_in = h_col * stride_h - pad_h;\n    const int w_in = w_col * stride_w - pad_w;\n\n    //  float *data_col_ptr = data_col + ((c_col * batch_size + b_col) * height_col + h_col) * width_col + w_col;\n    float *data_col_ptr = data_col + ((b_col * num_channels * kernel_w * kernel_h + c_col) * height_col + h_col) * width_col + w_col;\n    //const float* data_im_ptr = data_im + ((b_col * num_channels + c_im) * height + h_in) * width + w_in;\n    const float *data_im_ptr = data_im + (b_col * num_channels + c_im) * height * width;\n    const float *data_offset_ptr = data_offset + (b_col * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n\n    const float *data_mask_ptr = data_mask + (b_col * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n\n    for (int i = 0; i < kernel_h; ++i)\n    {\n      for (int j = 0; j < kernel_w; ++j)\n      {\n        const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_col) * width_col + w_col;\n        const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_col) * width_col + w_col;\n        const int data_mask_hw_ptr = ((i * kernel_w + j) * height_col + h_col) * width_col + w_col;\n        const float offset_h = data_offset_ptr[data_offset_h_ptr];\n        const float offset_w = data_offset_ptr[data_offset_w_ptr];\n        const float mask = data_mask_ptr[data_mask_hw_ptr];\n        float val = static_cast<float>(0);\n        const float h_im = h_in + i * dilation_h + offset_h;\n        const float w_im = w_in + j * dilation_w + offset_w;\n        //if (h_im >= 0 && w_im >= 0 && h_im < height && w_im < width) {\n        if (h_im > -1 && w_im > -1 && h_im < height && w_im < width)\n        {\n          //const float map_h = i * dilation_h + offset_h;\n          //const float map_w = j * dilation_w + offset_w;\n          //const int cur_height = height - h_in;\n          //const int cur_width = width - w_in;\n          //val = dmcn_im2col_bilinear_cpu(data_im_ptr, width, cur_height, cur_width, map_h, map_w);\n          val = dmcn_im2col_bilinear_cpu(data_im_ptr, width, height, width, h_im, w_im);\n        }\n        *data_col_ptr = val * mask;\n        // data_col_ptr += batch_size * height_col * width_col;\n        data_col_ptr += height_col * width_col;\n      }\n    }\n  }\n}\n\nvoid modulated_deformable_col2im_cpu_kernel(const int n, const float *data_col, const float *data_offset, const float *data_mask,\n                                                       const int channels, const int height, const int width,\n                                                       const int kernel_h, const int kernel_w,\n                                                       const int pad_h, const int pad_w,\n                                                       const int stride_h, const int stride_w,\n                                                       const int dilation_h, const int dilation_w,\n                                                       const int channel_per_deformable_group,\n                                                       const int batch_size, const int deformable_group,\n                                                       const int height_col, const int width_col,\n                                                       float *grad_im)\n{\n  for(int index = 0; index < n; index++)\n  {\n    const int j = (index / width_col / height_col / batch_size) % kernel_w;\n    const int i = (index / width_col / height_col / batch_size / kernel_w) % kernel_h;\n    const int c = index / width_col / height_col / batch_size / kernel_w / kernel_h;\n    // compute the start and end of the output\n\n    const int deformable_group_index = c / channel_per_deformable_group;\n\n    int w_out = index % width_col;\n    int h_out = (index / width_col) % height_col;\n    int b = (index / width_col / height_col) % batch_size;\n    int w_in = w_out * stride_w - pad_w;\n    int h_in = h_out * stride_h - pad_h;\n\n    const float *data_offset_ptr = data_offset + (b * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n    const float *data_mask_ptr = data_mask + (b * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n    const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out;\n    const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out;\n    const int data_mask_hw_ptr = ((i * kernel_w + j) * height_col + h_out) * width_col + w_out;\n    const float offset_h = data_offset_ptr[data_offset_h_ptr];\n    const float offset_w = data_offset_ptr[data_offset_w_ptr];\n    const float mask = data_mask_ptr[data_mask_hw_ptr];\n    const float cur_inv_h_data = h_in + i * dilation_h + offset_h;\n    const float cur_inv_w_data = w_in + j * dilation_w + offset_w;\n\n    const float cur_top_grad = data_col[index] * mask;\n    const int cur_h = (int)cur_inv_h_data;\n    const int cur_w = (int)cur_inv_w_data;\n    \n    for (int dy = -2; dy <= 2; dy++)\n    {\n      for (int dx = -2; dx <= 2; dx++)\n      {\n        if (cur_h + dy >= 0 && cur_h + dy < height &&\n            cur_w + dx >= 0 && cur_w + dx < width &&\n            abs(cur_inv_h_data - (cur_h + dy)) < 1 &&\n            abs(cur_inv_w_data - (cur_w + dx)) < 1)\n        {\n          int cur_bottom_grad_pos = ((b * channels + c) * height + cur_h + dy) * width + cur_w + dx;\n          float weight = dmcn_get_gradient_weight_cpu(cur_inv_h_data, cur_inv_w_data, cur_h + dy, cur_w + dx, height, width);\n          //atomicAdd(grad_im + cur_bottom_grad_pos, weight * cur_top_grad);\n          *(grad_im + cur_bottom_grad_pos) += weight * cur_top_grad;\n\n        }\n      }\n    }\n  }\n}\n\nvoid modulated_deformable_col2im_coord_cpu_kernel(const int n, const float *data_col, const float *data_im,\n                                                             const float *data_offset, const float *data_mask,\n                                                             const int channels, const int height, const int width,\n                                                             const int kernel_h, const int kernel_w,\n                                                             const int pad_h, const int pad_w,\n                                                             const int stride_h, const int stride_w,\n                                                             const int dilation_h, const int dilation_w,\n                                                             const int channel_per_deformable_group,\n                                                             const int batch_size, const int offset_channels, const int deformable_group,\n                                                             const int height_col, const int width_col,\n                                                             float *grad_offset, float *grad_mask)\n{\n  for(int index = 0; index < n; index++)\n  {\n    float val = 0, mval = 0;\n    int w = index % width_col;\n    int h = (index / width_col) % height_col;\n    int c = (index / width_col / height_col) % offset_channels;\n    int b = (index / width_col / height_col) / offset_channels;\n    // compute the start and end of the output\n\n    const int deformable_group_index = c / (2 * kernel_h * kernel_w);\n    const int col_step = kernel_h * kernel_w;\n    int cnt = 0;\n    const float *data_col_ptr = data_col + deformable_group_index * channel_per_deformable_group * batch_size * width_col * height_col;\n    const float *data_im_ptr = data_im + (b * deformable_group + deformable_group_index) * channel_per_deformable_group / kernel_h / kernel_w * height * width;\n    const float *data_offset_ptr = data_offset + (b * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n    const float *data_mask_ptr = data_mask + (b * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n\n    const int offset_c = c - deformable_group_index * 2 * kernel_h * kernel_w;\n\n    for (int col_c = (offset_c / 2); col_c < channel_per_deformable_group; col_c += col_step)\n    {\n      const int col_pos = (((col_c * batch_size + b) * height_col) + h) * width_col + w;\n      const int bp_dir = offset_c % 2;\n\n      int j = (col_pos / width_col / height_col / batch_size) % kernel_w;\n      int i = (col_pos / width_col / height_col / batch_size / kernel_w) % kernel_h;\n      int w_out = col_pos % width_col;\n      int h_out = (col_pos / width_col) % height_col;\n      int w_in = w_out * stride_w - pad_w;\n      int h_in = h_out * stride_h - pad_h;\n      const int data_offset_h_ptr = (((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out);\n      const int data_offset_w_ptr = (((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out);\n      const int data_mask_hw_ptr = (((i * kernel_w + j) * height_col + h_out) * width_col + w_out);\n      const float offset_h = data_offset_ptr[data_offset_h_ptr];\n      const float offset_w = data_offset_ptr[data_offset_w_ptr];\n      const float mask = data_mask_ptr[data_mask_hw_ptr];\n      float inv_h = h_in + i * dilation_h + offset_h;\n      float inv_w = w_in + j * dilation_w + offset_w;\n      if (inv_h <= -1 || inv_w <= -1 || inv_h >= height || inv_w >= width)\n      {\n        inv_h = inv_w = -2;\n      }\n      else\n      {\n        mval += data_col_ptr[col_pos] * dmcn_im2col_bilinear_cpu(data_im_ptr + cnt * height * width, width, height, width, inv_h, inv_w);\n      }\n      const float weight = dmcn_get_coordinate_weight_cpu(\n          inv_h, inv_w,\n          height, width, data_im_ptr + cnt * height * width, width, bp_dir);\n      val += weight * data_col_ptr[col_pos] * mask;\n      cnt += 1;\n    }\n    // KERNEL_ASSIGN(grad_offset[index], offset_req, val);\n    grad_offset[index] = val;\n    if (offset_c % 2 == 0)\n      // KERNEL_ASSIGN(grad_mask[(((b * deformable_group + deformable_group_index) * kernel_h * kernel_w + offset_c / 2) * height_col + h) * width_col + w], mask_req, mval);\n      grad_mask[(((b * deformable_group + deformable_group_index) * kernel_h * kernel_w + offset_c / 2) * height_col + h) * width_col + w] = mval;\n  }\n}\n\nvoid modulated_deformable_im2col_cpu(const float* data_im, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w,\n  const int deformable_group, float* data_col) {\n  // num_axes should be smaller than block size\n  const int channel_per_deformable_group = channels / deformable_group;\n  const int num_kernels = channels * batch_size * height_col * width_col;\n  modulated_deformable_im2col_cpu_kernel(\n      num_kernels, data_im, data_offset, data_mask, height_im, width_im, kernel_h, kernel_w,\n      pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, channel_per_deformable_group,\n      batch_size, channels, deformable_group, height_col, width_col, data_col);\n  \n  /*cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_im2col_cuda: %s\\n\", cudaGetErrorString(err));\n  }*/\n\n}\n\nvoid modulated_deformable_col2im_cpu(const float* data_col, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w, \n  const int deformable_group, float* grad_im){\n\n  const int channel_per_deformable_group = channels / deformable_group;\n  const int num_kernels = channels * kernel_h * kernel_w * batch_size * height_col * width_col;\n  modulated_deformable_col2im_cpu_kernel(\n        num_kernels, data_col, data_offset, data_mask, channels, height_im, width_im,\n        kernel_h, kernel_w, pad_h, pad_h, stride_h, stride_w,\n        dilation_h, dilation_w, channel_per_deformable_group,\n        batch_size, deformable_group, height_col, width_col, grad_im);\n  /*cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_col2im_cuda: %s\\n\", cudaGetErrorString(err));\n  }*/\n\n}\n\nvoid modulated_deformable_col2im_coord_cpu(const float* data_col, const float* data_im, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w, \n  const int deformable_group,\n  float* grad_offset, float* grad_mask) {\n  const int num_kernels = batch_size * height_col * width_col * 2 * kernel_h * kernel_w * deformable_group;\n  const int channel_per_deformable_group = channels * kernel_h * kernel_w / deformable_group;\n  modulated_deformable_col2im_coord_cpu_kernel(\n        num_kernels, data_col, data_im, data_offset, data_mask, channels, height_im, width_im,\n        kernel_h, kernel_w, pad_h, pad_w, stride_h, stride_w,\n        dilation_h, dilation_w, channel_per_deformable_group,\n        batch_size, 2 * kernel_h * kernel_w * deformable_group, deformable_group, height_col, width_col, \n        grad_offset, grad_mask);\n  /*cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_col2im_coord_cuda: %s\\n\", cudaGetErrorString(err));\n  }*/\n}"
  },
  {
    "path": "DCNv2/src/cpu/dcn_v2_im2col_cpu.h",
    "content": "\n/*!\n ******************* BEGIN Caffe Copyright Notice and Disclaimer ****************\n *\n * COPYRIGHT\n *\n * All contributions by the University of California:\n * Copyright (c) 2014-2017 The Regents of the University of California (Regents)\n * All rights reserved.\n *\n * All other contributions:\n * Copyright (c) 2014-2017, the respective contributors\n * All rights reserved.\n *\n * Caffe uses a shared copyright model: each contributor holds copyright over\n * their contributions to Caffe. The project versioning records all such\n * contribution and copyright details. If a contributor wants to further mark\n * their specific copyright on a particular contribution, they should indicate\n * their copyright solely in the commit message of the change when it is\n * committed.\n *\n * LICENSE\n *\n * Redistribution and use in source and binary forms, with or without\n * modification, are permitted provided that the following conditions are met:\n *\n * 1. Redistributions of source code must retain the above copyright notice, this\n * list of conditions and the following disclaimer.\n * 2. Redistributions in binary form must reproduce the above copyright notice,\n * this list of conditions and the following disclaimer in the documentation\n * and/or other materials provided with the distribution.\n *\n * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\n * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\n * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR\n * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\n * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND\n * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n *\n * CONTRIBUTION AGREEMENT\n *\n * By contributing to the BVLC/caffe repository through pull-request, comment,\n * or otherwise, the contributor releases their content to the\n * license and copyright terms herein.\n *\n ***************** END Caffe Copyright Notice and Disclaimer ********************\n *\n * Copyright (c) 2018 Microsoft\n * Licensed under The MIT License [see LICENSE for details]\n * \\file modulated_deformable_im2col.h\n * \\brief Function definitions of converting an image to\n * column matrix based on kernel, padding, dilation, and offset.\n * These functions are mainly used in deformable convolution operators.\n * \\ref: https://arxiv.org/abs/1811.11168\n * \\author Yuwen Xiong, Haozhi Qi, Jifeng Dai, Xizhou Zhu, Han Hu\n */\n\n/***************** Adapted by Charles Shang *********************/\n// modified from the CUDA version for CPU use by Daniel K. Suhendro\n\n#ifndef DCN_V2_IM2COL_CPU\n#define DCN_V2_IM2COL_CPU\n\n#ifdef __cplusplus\nextern \"C\"\n{\n#endif\n\n  void modulated_deformable_im2col_cpu(const float *data_im, const float *data_offset, const float *data_mask,\n                                        const int batch_size, const int channels, const int height_im, const int width_im,\n                                        const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                        const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                        const int dilation_h, const int dilation_w,\n                                        const int deformable_group, float *data_col);\n\n  void modulated_deformable_col2im_cpu(const float *data_col, const float *data_offset, const float *data_mask,\n                                        const int batch_size, const int channels, const int height_im, const int width_im,\n                                        const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                        const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                        const int dilation_h, const int dilation_w,\n                                        const int deformable_group, float *grad_im);\n\n  void modulated_deformable_col2im_coord_cpu(const float *data_col, const float *data_im, const float *data_offset, const float *data_mask,\n                                         const int batch_size, const int channels, const int height_im, const int width_im,\n                                         const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                         const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                         const int dilation_h, const int dilation_w,\n                                         const int deformable_group,\n                                         float *grad_offset, float *grad_mask);\n\n#ifdef __cplusplus\n}\n#endif\n\n#endif"
  },
  {
    "path": "DCNv2/src/cpu/dcn_v2_psroi_pooling_cpu.cpp",
    "content": "/*!\n * Copyright (c) 2017 Microsoft\n * Licensed under The MIT License [see LICENSE for details]\n * \\file deformable_psroi_pooling.cu\n * \\brief\n * \\author Yi Li, Guodong Zhang, Jifeng Dai\n*/\n/***************** Adapted by Charles Shang *********************/\n// modified from the CUDA version for CPU use by Daniel K. Suhendro\n\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#include <ATen/ATen.h>\n//#include <ATen/cuda/CUDAContext.h>\n\n#include <TH/TH.h>\n//#include <THC/THCAtomics.cuh>\n//#include <THC/THCDeviceUtils.cuh>\n\n/*#define CUDA_KERNEL_LOOP(i, n)                        \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x; \\\n       i < (n);                                       \\\n       i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N)\n{\n  return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;\n}*/\n\ntemplate <typename T>\nT bilinear_interp_cpu(\n    const T *data,\n    const T x,\n    const T y,\n    const int width,\n    const int height)\n{\n  int x1 = floor(x);\n  int x2 = ceil(x);\n  int y1 = floor(y);\n  int y2 = ceil(y);\n  T dist_x = static_cast<T>(x - x1);\n  T dist_y = static_cast<T>(y - y1);\n  T value11 = data[y1 * width + x1];\n  T value12 = data[y2 * width + x1];\n  T value21 = data[y1 * width + x2];\n  T value22 = data[y2 * width + x2];\n  T value = (1 - dist_x) * (1 - dist_y) * value11 +\n            (1 - dist_x) * dist_y * value12 +\n            dist_x * (1 - dist_y) * value21 +\n            dist_x * dist_y * value22;\n  return value;\n}\n\ntemplate <typename T>\n void DeformablePSROIPoolForwardKernelCpu(\n    const int count,\n    const T *bottom_data,\n    const T spatial_scale,\n    const int channels,\n    const int height, const int width,\n    const int pooled_height, const int pooled_width,\n    const T *bottom_rois, const T *bottom_trans,\n    const int no_trans,\n    const T trans_std,\n    const int sample_per_part,\n    const int output_dim,\n    const int group_size,\n    const int part_size,\n    const int num_classes,\n    const int channels_each_class,\n    T *top_data,\n    T *top_count)\n{\n  for(int index = 0; index < count; index++)\n  {\n    // The output is in order (n, ctop, ph, pw)\n    int pw = index % pooled_width;\n    int ph = (index / pooled_width) % pooled_height;\n    int ctop = (index / pooled_width / pooled_height) % output_dim;\n    int n = index / pooled_width / pooled_height / output_dim;\n\n    // [start, end) interval for spatial sampling\n    const T *offset_bottom_rois = bottom_rois + n * 5;\n    int roi_batch_ind = offset_bottom_rois[0];\n    T roi_start_w = static_cast<T>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;\n    T roi_start_h = static_cast<T>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;\n    T roi_end_w = static_cast<T>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;\n    T roi_end_h = static_cast<T>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;\n\n    // Force too small ROIs to be 1x1\n    T roi_width = std::max(roi_end_w - roi_start_w, T(0.1)); //avoid 0\n    T roi_height = std::max(roi_end_h - roi_start_h, T(0.1));\n\n    // Compute w and h at bottom\n    T bin_size_h = roi_height / static_cast<T>(pooled_height);\n    T bin_size_w = roi_width / static_cast<T>(pooled_width);\n\n    T sub_bin_size_h = bin_size_h / static_cast<T>(sample_per_part);\n    T sub_bin_size_w = bin_size_w / static_cast<T>(sample_per_part);\n\n    int part_h = floor(static_cast<T>(ph) / pooled_height * part_size);\n    int part_w = floor(static_cast<T>(pw) / pooled_width * part_size);\n    int class_id = ctop / channels_each_class;\n    T trans_x = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;\n    T trans_y = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;\n\n    T wstart = static_cast<T>(pw) * bin_size_w + roi_start_w;\n    wstart += trans_x * roi_width;\n    T hstart = static_cast<T>(ph) * bin_size_h + roi_start_h;\n    hstart += trans_y * roi_height;\n\n    T sum = 0;\n    int count = 0;\n    int gw = floor(static_cast<T>(pw) * group_size / pooled_width);\n    int gh = floor(static_cast<T>(ph) * group_size / pooled_height);\n    gw = std::min(std::max(gw, 0), group_size - 1);\n    gh = std::min(std::max(gh, 0), group_size - 1);\n\n    const T *offset_bottom_data = bottom_data + (roi_batch_ind * channels) * height * width;\n    for (int ih = 0; ih < sample_per_part; ih++)\n    {\n      for (int iw = 0; iw < sample_per_part; iw++)\n      {\n        T w = wstart + iw * sub_bin_size_w;\n        T h = hstart + ih * sub_bin_size_h;\n        // bilinear interpolation\n        if (w < -0.5 || w > width - 0.5 || h < -0.5 || h > height - 0.5)\n        {\n          continue;\n        }\n        w = std::min(std::max(w, T(0.)), width - T(1.));\n        h = std::min(std::max(h, T(0.)), height - T(1.));\n        int c = (ctop * group_size + gh) * group_size + gw;\n        T val = bilinear_interp_cpu(offset_bottom_data + c * height * width, w, h, width, height);\n        sum += val;\n        count++;\n      }\n    }\n    top_data[index] = count == 0 ? static_cast<T>(0) : sum / count;\n    top_count[index] = count;\n  }\n}\n\ntemplate <typename T>\nvoid DeformablePSROIPoolBackwardAccKernelCpu(\n    const int count,\n    const T *top_diff,\n    const T *top_count,\n    const int num_rois,\n    const T spatial_scale,\n    const int channels,\n    const int height, const int width,\n    const int pooled_height, const int pooled_width,\n    const int output_dim,\n    T *bottom_data_diff, T *bottom_trans_diff,\n    const T *bottom_data,\n    const T *bottom_rois,\n    const T *bottom_trans,\n    const int no_trans,\n    const T trans_std,\n    const int sample_per_part,\n    const int group_size,\n    const int part_size,\n    const int num_classes,\n    const int channels_each_class)\n{\n  for(int index = 0; index < count; index++)\n  {\n    // The output is in order (n, ctop, ph, pw)\n    int pw = index % pooled_width;\n    int ph = (index / pooled_width) % pooled_height;\n    int ctop = (index / pooled_width / pooled_height) % output_dim;\n    int n = index / pooled_width / pooled_height / output_dim;\n\n    // [start, end) interval for spatial sampling\n    const T *offset_bottom_rois = bottom_rois + n * 5;\n    int roi_batch_ind = offset_bottom_rois[0];\n    T roi_start_w = static_cast<T>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;\n    T roi_start_h = static_cast<T>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;\n    T roi_end_w = static_cast<T>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;\n    T roi_end_h = static_cast<T>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;\n    \n    // Force too small ROIs to be 1x1\n    T roi_width = std::max(roi_end_w - roi_start_w, T(0.1)); //avoid 0\n    T roi_height = std::max(roi_end_h - roi_start_h, T(0.1));\n\n    // Compute w and h at bottom\n    T bin_size_h = roi_height / static_cast<T>(pooled_height);\n    T bin_size_w = roi_width / static_cast<T>(pooled_width);\n\n    T sub_bin_size_h = bin_size_h / static_cast<T>(sample_per_part);\n    T sub_bin_size_w = bin_size_w / static_cast<T>(sample_per_part);\n\n    int part_h = floor(static_cast<T>(ph) / pooled_height * part_size);\n    int part_w = floor(static_cast<T>(pw) / pooled_width * part_size);\n    int class_id = ctop / channels_each_class;\n    T trans_x = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;\n    T trans_y = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;\n\n    T wstart = static_cast<T>(pw) * bin_size_w + roi_start_w;\n    wstart += trans_x * roi_width;\n    T hstart = static_cast<T>(ph) * bin_size_h + roi_start_h;\n    hstart += trans_y * roi_height;\n\n    if (top_count[index] <= 0)\n    {\n      continue;\n    }\n    T diff_val = top_diff[index] / top_count[index];\n    const T *offset_bottom_data = bottom_data + roi_batch_ind * channels * height * width;\n    T *offset_bottom_data_diff = bottom_data_diff + roi_batch_ind * channels * height * width;\n    int gw = floor(static_cast<T>(pw) * group_size / pooled_width);\n    int gh = floor(static_cast<T>(ph) * group_size / pooled_height);\n    gw = std::min(std::max(gw, 0), group_size - 1);\n    gh = std::min(std::max(gh, 0), group_size - 1);\n\n    for (int ih = 0; ih < sample_per_part; ih++)\n    {\n      for (int iw = 0; iw < sample_per_part; iw++)\n      {\n        T w = wstart + iw * sub_bin_size_w;\n        T h = hstart + ih * sub_bin_size_h;\n        // bilinear interpolation\n        if (w < -0.5 || w > width - 0.5 || h < -0.5 || h > height - 0.5)\n        {\n          continue;\n        }\n        w = std::min(std::max(w, T(0.)), width - T(1.));\n        h = std::min(std::max(h, T(0.)), height - T(1.));\n        int c = (ctop * group_size + gh) * group_size + gw;\n        // backward on feature\n        int x0 = floor(w);\n        int x1 = ceil(w);\n        int y0 = floor(h);\n        int y1 = ceil(h);\n        T dist_x = w - x0, dist_y = h - y0;\n        T q00 = (1 - dist_x) * (1 - dist_y);\n        T q01 = (1 - dist_x) * dist_y;\n        T q10 = dist_x * (1 - dist_y);\n        T q11 = dist_x * dist_y;\n        int bottom_index_base = c * height * width;\n        /*atomicAdd(offset_bottom_data_diff + bottom_index_base + y0 * width + x0, q00 * diff_val);\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y1 * width + x0, q01 * diff_val);\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y0 * width + x1, q10 * diff_val);\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y1 * width + x1, q11 * diff_val);*/\n       *(offset_bottom_data_diff + bottom_index_base + y0 * width + x0) += q00 * diff_val;\n       *(offset_bottom_data_diff + bottom_index_base + y1 * width + x0) += q01 * diff_val;\n       *(offset_bottom_data_diff + bottom_index_base + y0 * width + x1) += q10 * diff_val;\n       *(offset_bottom_data_diff + bottom_index_base + y1 * width + x1) += q11 * diff_val;\n\n\n        if (no_trans)\n        {\n          continue;\n        }\n        T U00 = offset_bottom_data[bottom_index_base + y0 * width + x0];\n        T U01 = offset_bottom_data[bottom_index_base + y1 * width + x0];\n        T U10 = offset_bottom_data[bottom_index_base + y0 * width + x1];\n        T U11 = offset_bottom_data[bottom_index_base + y1 * width + x1];\n        T diff_x = (U11 * dist_y + U10 * (1 - dist_y) - U01 * dist_y - U00 * (1 - dist_y)) * trans_std * diff_val;\n        diff_x *= roi_width;\n        T diff_y = (U11 * dist_x + U01 * (1 - dist_x) - U10 * dist_x - U00 * (1 - dist_x)) * trans_std * diff_val;\n        diff_y *= roi_height;\n\n        /*atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w, diff_x);\n        atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w, diff_y);*/\n        *(bottom_trans_diff + (((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w) += diff_x;\n        *(bottom_trans_diff + (((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w) += diff_y;\n      }\n    }\n  }\n}\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cpu_forward(const at::Tensor &input,\n                                  const at::Tensor &bbox,\n                                  const at::Tensor &trans,\n                                  const int no_trans,\n                                  const float spatial_scale,\n                                  const int output_dim,\n                                  const int group_size,\n                                  const int pooled_size,\n                                  const int part_size,\n                                  const int sample_per_part,\n                                  const float trans_std)\n{\n  /*AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n  AT_ASSERTM(bbox.type().is_cuda(), \"rois must be a CUDA tensor\");\n  AT_ASSERTM(trans.type().is_cuda(), \"trans must be a CUDA tensor\");*/\n\n  const int batch = input.size(0);\n  const int channels = input.size(1);\n  const int height = input.size(2);\n  const int width = input.size(3);\n  const int channels_trans = no_trans ? 2 : trans.size(1);\n  const int num_bbox = bbox.size(0);\n\n  AT_ASSERTM(channels == output_dim, \"input channels and output channels must equal\");\n  auto pooled_height = pooled_size;\n  auto pooled_width = pooled_size;\n\n  auto out = at::empty({num_bbox, output_dim, pooled_height, pooled_width}, input.options());\n  long out_size = num_bbox * output_dim * pooled_height * pooled_width;\n  auto top_count = at::zeros({num_bbox, output_dim, pooled_height, pooled_width}, input.options());\n\n  const int num_classes = no_trans ? 1 : channels_trans / 2;\n  const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;\n\n  //cudaStream_t stream = at::cuda::getCurrentCUDAStream();\n\n  if (out.numel() == 0)\n  {\n    //THCudaCheck(cudaGetLastError());\n    return std::make_tuple(out, top_count);\n  }\n\n  /*dim3 grid(std::min(THCCeilDiv(out_size, 512L), 4096L));\n  dim3 block(512);*/\n\n  AT_DISPATCH_FLOATING_TYPES(input.type(), \"dcn_v2_psroi_pooling_cpu_forward\", [&] {\n    DeformablePSROIPoolForwardKernelCpu<scalar_t>(\n        out_size,\n        input.contiguous().data<scalar_t>(),\n        spatial_scale,\n        channels,\n        height, width,\n        pooled_height,\n        pooled_width,\n        bbox.contiguous().data<scalar_t>(),\n        trans.contiguous().data<scalar_t>(),\n        no_trans,\n        trans_std,\n        sample_per_part,\n        output_dim,\n        group_size,\n        part_size,\n        num_classes,\n        channels_each_class,\n        out.data<scalar_t>(),\n        top_count.data<scalar_t>());\n  });\n  //THCudaCheck(cudaGetLastError());\n  return std::make_tuple(out, top_count);\n}\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cpu_backward(const at::Tensor &out_grad,\n                                   const at::Tensor &input,\n                                   const at::Tensor &bbox,\n                                   const at::Tensor &trans,\n                                   const at::Tensor &top_count,\n                                   const int no_trans,\n                                   const float spatial_scale,\n                                   const int output_dim,\n                                   const int group_size,\n                                   const int pooled_size,\n                                   const int part_size,\n                                   const int sample_per_part,\n                                   const float trans_std)\n{\n  /*AT_ASSERTM(out_grad.type().is_cuda(), \"out_grad must be a CUDA tensor\");\n  AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n  AT_ASSERTM(bbox.type().is_cuda(), \"bbox must be a CUDA tensor\");\n  AT_ASSERTM(trans.type().is_cuda(), \"trans must be a CUDA tensor\");\n  AT_ASSERTM(top_count.type().is_cuda(), \"top_count must be a CUDA tensor\");*/\n\n  const int batch = input.size(0);\n  const int channels = input.size(1);\n  const int height = input.size(2);\n  const int width = input.size(3);\n  const int channels_trans = no_trans ? 2 : trans.size(1);\n  const int num_bbox = bbox.size(0);\n\n  AT_ASSERTM(channels == output_dim, \"input channels and output channels must equal\");\n  auto pooled_height = pooled_size;\n  auto pooled_width = pooled_size;\n  long out_size = num_bbox * output_dim * pooled_height * pooled_width;\n  const int num_classes = no_trans ? 1 : channels_trans / 2;\n  const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;\n\n  auto input_grad = at::zeros({batch, channels, height, width}, out_grad.options());\n  auto trans_grad = at::zeros_like(trans);\n\n  if (input_grad.numel() == 0)\n  {\n    //THCudaCheck(cudaGetLastError());\n    return std::make_tuple(input_grad, trans_grad);\n  }\n\n  /*dim3 grid(std::min(THCCeilDiv(out_size, 512L), 4096L));\n  dim3 block(512);\n  cudaStream_t stream = at::cuda::getCurrentCUDAStream();*/\n\n  AT_DISPATCH_FLOATING_TYPES(out_grad.type(), \"dcn_v2_psroi_pooling_cpu_backward\", [&] {\n    DeformablePSROIPoolBackwardAccKernelCpu<scalar_t>(\n        out_size,\n        out_grad.contiguous().data<scalar_t>(),\n        top_count.contiguous().data<scalar_t>(),\n        num_bbox,\n        spatial_scale,\n        channels,\n        height,\n        width,\n        pooled_height,\n        pooled_width,\n        output_dim,\n        input_grad.contiguous().data<scalar_t>(),\n        trans_grad.contiguous().data<scalar_t>(),\n        input.contiguous().data<scalar_t>(),\n        bbox.contiguous().data<scalar_t>(),\n        trans.contiguous().data<scalar_t>(),\n        no_trans,\n        trans_std,\n        sample_per_part,\n        group_size,\n        part_size,\n        num_classes,\n        channels_each_class);\n  });\n  //THCudaCheck(cudaGetLastError());\n  return std::make_tuple(input_grad, trans_grad);\n}"
  },
  {
    "path": "DCNv2/src/cpu/vision.h",
    "content": "#pragma once\n#include <torch/extension.h>\n\nat::Tensor\ndcn_v2_cpu_forward(const at::Tensor &input,\n                    const at::Tensor &weight,\n                    const at::Tensor &bias,\n                    const at::Tensor &offset,\n                    const at::Tensor &mask,\n                    const int kernel_h,\n                    const int kernel_w,\n                    const int stride_h,\n                    const int stride_w,\n                    const int pad_h,\n                    const int pad_w,\n                    const int dilation_h,\n                    const int dilation_w,\n                    const int deformable_group);\n\nstd::vector<at::Tensor>\ndcn_v2_cpu_backward(const at::Tensor &input,\n                     const at::Tensor &weight,\n                     const at::Tensor &bias,\n                     const at::Tensor &offset,\n                     const at::Tensor &mask,\n                     const at::Tensor &grad_output,\n                     int kernel_h, int kernel_w,\n                     int stride_h, int stride_w,\n                     int pad_h, int pad_w,\n                     int dilation_h, int dilation_w,\n                     int deformable_group);\n\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cpu_forward(const at::Tensor &input,\n                                  const at::Tensor &bbox,\n                                  const at::Tensor &trans,\n                                  const int no_trans,\n                                  const float spatial_scale,\n                                  const int output_dim,\n                                  const int group_size,\n                                  const int pooled_size,\n                                  const int part_size,\n                                  const int sample_per_part,\n                                  const float trans_std);\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cpu_backward(const at::Tensor &out_grad,\n                                   const at::Tensor &input,\n                                   const at::Tensor &bbox,\n                                   const at::Tensor &trans,\n                                   const at::Tensor &top_count,\n                                   const int no_trans,\n                                   const float spatial_scale,\n                                   const int output_dim,\n                                   const int group_size,\n                                   const int pooled_size,\n                                   const int part_size,\n                                   const int sample_per_part,\n                                   const float trans_std);"
  },
  {
    "path": "DCNv2/src/cuda/dcn_v2_cuda.cu",
    "content": "#include <vector>\n#include \"cuda/dcn_v2_im2col_cuda.h\"\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n#include <THC/THC.h>\n#include <THC/THCAtomics.cuh>\n#include <THC/THCDeviceUtils.cuh>\n\nTHCState *state = at::globalContext().lazyInitCUDA();\n\n// author: Charles Shang\n// https://github.com/torch/cunn/blob/master/lib/THCUNN/generic/SpatialConvolutionMM.cu\n\n// [batch gemm]\n// https://github.com/pytorch/pytorch/blob/master/aten/src/THC/generic/THCTensorMathBlas.cu\n\n__global__ void createBatchGemmBuffer(const float **input_b, float **output_b,\n                                      float **columns_b, const float **ones_b,\n                                      const float **weight_b, const float **bias_b,\n                                      float *input, float *output,\n                                      float *columns, float *ones,\n                                      float *weight, float *bias,\n                                      const int input_stride, const int output_stride,\n                                      const int columns_stride, const int ones_stride,\n                                      const int num_batches)\n{\n    const int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx < num_batches)\n    {\n        input_b[idx] = input + idx * input_stride;\n        output_b[idx] = output + idx * output_stride;\n        columns_b[idx] = columns + idx * columns_stride;\n        ones_b[idx] = ones + idx * ones_stride;\n        // share weights and bias within a Mini-Batch\n        weight_b[idx] = weight;\n        bias_b[idx] = bias;\n    }\n}\n\nat::Tensor\ndcn_v2_cuda_forward(const at::Tensor &input,\n                    const at::Tensor &weight,\n                    const at::Tensor &bias,\n                    const at::Tensor &offset,\n                    const at::Tensor &mask,\n                    const int kernel_h,\n                    const int kernel_w,\n                    const int stride_h,\n                    const int stride_w,\n                    const int pad_h,\n                    const int pad_w,\n                    const int dilation_h,\n                    const int dilation_w,\n                    const int deformable_group)\n{\n    using scalar_t = float;\n    // THCAssertSameGPU(THCudaTensor_checkGPU(state, 5, input, weight, bias, offset, mask));\n    AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n    AT_ASSERTM(weight.type().is_cuda(), \"weight must be a CUDA tensor\");\n    AT_ASSERTM(bias.type().is_cuda(), \"bias must be a CUDA tensor\");\n    AT_ASSERTM(offset.type().is_cuda(), \"offset must be a CUDA tensor\");\n    AT_ASSERTM(mask.type().is_cuda(), \"mask must be a CUDA tensor\");\n\n    const int batch = input.size(0);\n    const int channels = input.size(1);\n    const int height = input.size(2);\n    const int width = input.size(3);\n\n    const int channels_out = weight.size(0);\n    const int channels_kernel = weight.size(1);\n    const int kernel_h_ = weight.size(2);\n    const int kernel_w_ = weight.size(3);\n\n    // printf(\"Kernels: %d %d %d %d\\n\", kernel_h_, kernel_w_, kernel_w, kernel_h);\n    // printf(\"Channels: %d %d\\n\", channels, channels_kernel);\n    // printf(\"Channels: %d %d\\n\", channels_out, channels_kernel);\n\n    AT_ASSERTM(kernel_h_ == kernel_h && kernel_w_ == kernel_w,\n               \"Input shape and kernel shape wont match: (%d x %d vs %d x %d).\", kernel_h_, kernel_w, kernel_h_, kernel_w_);\n\n    AT_ASSERTM(channels == channels_kernel,\n               \"Input shape and kernel channels wont match: (%d vs %d).\", channels, channels_kernel);\n\n    const int height_out = (height + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;\n    const int width_out = (width + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;\n\n    auto ones = at::ones({batch, height_out, width_out}, input.options());\n    auto columns = at::empty({batch, channels * kernel_h * kernel_w, 1 * height_out * width_out}, input.options());\n    auto output = at::empty({batch, channels_out, height_out, width_out}, input.options());\n\n    // prepare for batch-wise computing, which is significantly faster than instance-wise computing\n    // when batch size is large.\n    // launch batch threads\n    int matrices_size = batch * sizeof(float *);\n    auto input_b = static_cast<const float **>(THCudaMalloc(state, matrices_size));\n    auto output_b = static_cast<float **>(THCudaMalloc(state, matrices_size));\n    auto columns_b = static_cast<float **>(THCudaMalloc(state, matrices_size));\n    auto ones_b = static_cast<const float **>(THCudaMalloc(state, matrices_size));\n    auto weight_b = static_cast<const float **>(THCudaMalloc(state, matrices_size));\n    auto bias_b = static_cast<const float **>(THCudaMalloc(state, matrices_size));\n\n    const int block = 128;\n    const int grid = (batch + block - 1) / block;\n\n    createBatchGemmBuffer<<<grid, block, 0, c10::cuda::getCurrentCUDAStream()>>>(\n        input_b, output_b,\n        columns_b, ones_b,\n        weight_b, bias_b,\n        input.data<scalar_t>(),\n        output.data<scalar_t>(),\n        columns.data<scalar_t>(),\n        ones.data<scalar_t>(),\n        weight.data<scalar_t>(),\n        bias.data<scalar_t>(),\n        channels * width * height,\n        channels_out * width_out * height_out,\n        channels * kernel_h * kernel_w * height_out * width_out,\n        height_out * width_out,\n        batch);\n\n    long m_ = channels_out;\n    long n_ = height_out * width_out;\n    long k_ = 1;\n    THCudaBlas_SgemmBatched(state,\n                            't',\n                            'n',\n                            n_,\n                            m_,\n                            k_,\n                            1.0f,\n                            ones_b, k_,\n                            bias_b, k_,\n                            0.0f,\n                            output_b, n_,\n                            batch);\n\n    modulated_deformable_im2col_cuda(c10::cuda::getCurrentCUDAStream(),\n                                     input.data<scalar_t>(),\n                                     offset.data<scalar_t>(),\n                                     mask.data<scalar_t>(),\n                                     batch, channels, height, width,\n                                     height_out, width_out, kernel_h, kernel_w,\n                                     pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w,\n                                     deformable_group,\n                                     columns.data<scalar_t>());\n\n    long m = channels_out;\n    long n = height_out * width_out;\n    long k = channels * kernel_h * kernel_w;\n    THCudaBlas_SgemmBatched(state,\n                            'n',\n                            'n',\n                            n,\n                            m,\n                            k,\n                            1.0f,\n                            (const float **)columns_b, n,\n                            weight_b, k,\n                            1.0f,\n                            output_b, n,\n                            batch);\n\n    THCudaFree(state, input_b);\n    THCudaFree(state, output_b);\n    THCudaFree(state, columns_b);\n    THCudaFree(state, ones_b);\n    THCudaFree(state, weight_b);\n    THCudaFree(state, bias_b);\n    return output;\n}\n\n__global__ void createBatchGemmBufferBackward(\n    float **grad_output_b,\n    float **columns_b,\n    float **ones_b,\n    float **weight_b,\n    float **grad_weight_b,\n    float **grad_bias_b,\n    float *grad_output,\n    float *columns,\n    float *ones,\n    float *weight,\n    float *grad_weight,\n    float *grad_bias,\n    const int grad_output_stride,\n    const int columns_stride,\n    const int ones_stride,\n    const int num_batches)\n{\n    const int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx < num_batches)\n    {\n        grad_output_b[idx] = grad_output + idx * grad_output_stride;\n        columns_b[idx] = columns + idx * columns_stride;\n        ones_b[idx] = ones + idx * ones_stride;\n\n        // share weights and bias within a Mini-Batch\n        weight_b[idx] = weight;\n        grad_weight_b[idx] = grad_weight;\n        grad_bias_b[idx] = grad_bias;\n    }\n}\n\nstd::vector<at::Tensor> dcn_v2_cuda_backward(const at::Tensor &input,\n                                             const at::Tensor &weight,\n                                             const at::Tensor &bias,\n                                             const at::Tensor &offset,\n                                             const at::Tensor &mask,\n                                             const at::Tensor &grad_output,\n                                             int kernel_h, int kernel_w,\n                                             int stride_h, int stride_w,\n                                             int pad_h, int pad_w,\n                                             int dilation_h, int dilation_w,\n                                             int deformable_group)\n{\n\n    THArgCheck(input.is_contiguous(), 1, \"input tensor has to be contiguous\");\n    THArgCheck(weight.is_contiguous(), 2, \"weight tensor has to be contiguous\");\n\n    AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n    AT_ASSERTM(weight.type().is_cuda(), \"weight must be a CUDA tensor\");\n    AT_ASSERTM(bias.type().is_cuda(), \"bias must be a CUDA tensor\");\n    AT_ASSERTM(offset.type().is_cuda(), \"offset must be a CUDA tensor\");\n    AT_ASSERTM(mask.type().is_cuda(), \"mask must be a CUDA tensor\");\n\n    const int batch = input.size(0);\n    const int channels = input.size(1);\n    const int height = input.size(2);\n    const int width = input.size(3);\n\n    const int channels_out = weight.size(0);\n    const int channels_kernel = weight.size(1);\n    const int kernel_h_ = weight.size(2);\n    const int kernel_w_ = weight.size(3);\n\n    AT_ASSERTM(kernel_h_ == kernel_h && kernel_w_ == kernel_w,\n               \"Input shape and kernel shape wont match: (%d x %d vs %d x %d).\", kernel_h_, kernel_w, kernel_h_, kernel_w_);\n\n    AT_ASSERTM(channels == channels_kernel,\n               \"Input shape and kernel channels wont match: (%d vs %d).\", channels, channels_kernel);\n\n    const int height_out = (height + 2 * pad_h - (dilation_h * (kernel_h - 1) + 1)) / stride_h + 1;\n    const int width_out = (width + 2 * pad_w - (dilation_w * (kernel_w - 1) + 1)) / stride_w + 1;\n\n    auto ones = at::ones({height_out, width_out}, input.options());\n    auto columns = at::empty({channels * kernel_h * kernel_w, 1 * height_out * width_out}, input.options());\n    auto output = at::empty({batch, channels_out, height_out, width_out}, input.options());\n\n    auto grad_input = at::zeros_like(input);\n    auto grad_weight = at::zeros_like(weight);\n    auto grad_bias = at::zeros_like(bias);\n    auto grad_offset = at::zeros_like(offset);\n    auto grad_mask = at::zeros_like(mask);\n\n    using scalar_t = float;\n\n    for (int b = 0; b < batch; b++)\n    {\n        auto input_n = input.select(0, b);\n        auto offset_n = offset.select(0, b);\n        auto mask_n = mask.select(0, b);\n        auto grad_output_n = grad_output.select(0, b);\n        auto grad_input_n = grad_input.select(0, b);\n        auto grad_offset_n = grad_offset.select(0, b);\n        auto grad_mask_n = grad_mask.select(0, b);\n\n        long m = channels * kernel_h * kernel_w;\n        long n = height_out * width_out;\n        long k = channels_out;\n\n        THCudaBlas_Sgemm(state, 'n', 't', n, m, k, 1.0f,\n                         grad_output_n.data<scalar_t>(), n,\n                         weight.data<scalar_t>(), m, 0.0f,\n                         columns.data<scalar_t>(), n);\n\n        // gradient w.r.t. input coordinate data\n        modulated_deformable_col2im_coord_cuda(c10::cuda::getCurrentCUDAStream(),\n                                               columns.data<scalar_t>(),\n                                               input_n.data<scalar_t>(),\n                                               offset_n.data<scalar_t>(),\n                                               mask_n.data<scalar_t>(),\n                                               1, channels, height, width,\n                                               height_out, width_out, kernel_h, kernel_w,\n                                               pad_h, pad_w, stride_h, stride_w,\n                                               dilation_h, dilation_w, deformable_group,\n                                               grad_offset_n.data<scalar_t>(),\n                                               grad_mask_n.data<scalar_t>());\n        // gradient w.r.t. input data\n        modulated_deformable_col2im_cuda(c10::cuda::getCurrentCUDAStream(),\n                                         columns.data<scalar_t>(),\n                                         offset_n.data<scalar_t>(),\n                                         mask_n.data<scalar_t>(),\n                                         1, channels, height, width,\n                                         height_out, width_out, kernel_h, kernel_w,\n                                         pad_h, pad_w, stride_h, stride_w,\n                                         dilation_h, dilation_w, deformable_group,\n                                         grad_input_n.data<scalar_t>());\n\n        // gradient w.r.t. weight, dWeight should accumulate across the batch and group\n        modulated_deformable_im2col_cuda(c10::cuda::getCurrentCUDAStream(),\n                                         input_n.data<scalar_t>(),\n                                         offset_n.data<scalar_t>(),\n                                         mask_n.data<scalar_t>(),\n                                         1, channels, height, width,\n                                         height_out, width_out, kernel_h, kernel_w,\n                                         pad_h, pad_w, stride_h, stride_w,\n                                         dilation_h, dilation_w, deformable_group,\n                                         columns.data<scalar_t>());\n\n        long m_ = channels_out;\n        long n_ = channels * kernel_h * kernel_w;\n        long k_ = height_out * width_out;\n\n        THCudaBlas_Sgemm(state, 't', 'n', n_, m_, k_, 1.0f,\n                         columns.data<scalar_t>(), k_,\n                         grad_output_n.data<scalar_t>(), k_, 1.0f,\n                         grad_weight.data<scalar_t>(), n_);\n\n        // gradient w.r.t. bias\n        // long m_ = channels_out;\n        // long k__ = height_out * width_out;\n        // THCudaBlas_Sgemm(state,\n        //                  't', 'n',\n        //                  k_, m_, 1, 1.0f,\n        //                  grad_output_n.data<scalar_t>(), k_,\n        //                  ones.data<scalar_t>(), 1, 1.0f,\n        //                  grad_bias.data<scalar_t>(), 1);\n        THCudaBlas_Sgemm(state,\n            'N', 'N', 1, m_, k_, 1.0f,\n            ones.data<scalar_t>(), 1,\n            grad_output_n.data<scalar_t>(), k_,\n            1.0f,\n            grad_bias.data<scalar_t>(), 1);\n    }\n\n    return {\n        grad_input, grad_offset, grad_mask, grad_weight, grad_bias\n    };\n}\n"
  },
  {
    "path": "DCNv2/src/cuda/dcn_v2_im2col_cuda.cu",
    "content": "#include \"dcn_v2_im2col_cuda.h\"\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n#include <THC/THC.h>\n#include <THC/THCAtomics.cuh>\n#include <THC/THCDeviceUtils.cuh>\n\n#define CUDA_KERNEL_LOOP(i, n)                          \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x;   \\\n      i < (n);                                          \\\n      i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N)\n{\n  return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;\n}\n\n\n__device__ float dmcn_im2col_bilinear_cuda(const float *bottom_data, const int data_width,\n                                      const int height, const int width, float h, float w)\n{\n  int h_low = floor(h);\n  int w_low = floor(w);\n  int h_high = h_low + 1;\n  int w_high = w_low + 1;\n\n  float lh = h - h_low;\n  float lw = w - w_low;\n  float hh = 1 - lh, hw = 1 - lw;\n\n  float v1 = 0;\n  if (h_low >= 0 && w_low >= 0)\n    v1 = bottom_data[h_low * data_width + w_low];\n  float v2 = 0;\n  if (h_low >= 0 && w_high <= width - 1)\n    v2 = bottom_data[h_low * data_width + w_high];\n  float v3 = 0;\n  if (h_high <= height - 1 && w_low >= 0)\n    v3 = bottom_data[h_high * data_width + w_low];\n  float v4 = 0;\n  if (h_high <= height - 1 && w_high <= width - 1)\n    v4 = bottom_data[h_high * data_width + w_high];\n\n  float w1 = hh * hw, w2 = hh * lw, w3 = lh * hw, w4 = lh * lw;\n\n  float val = (w1 * v1 + w2 * v2 + w3 * v3 + w4 * v4);\n  return val;\n}\n\n__device__ float dmcn_get_gradient_weight_cuda(float argmax_h, float argmax_w,\n                                          const int h, const int w, const int height, const int width)\n{\n  if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)\n  {\n    //empty\n    return 0;\n  }\n\n  int argmax_h_low = floor(argmax_h);\n  int argmax_w_low = floor(argmax_w);\n  int argmax_h_high = argmax_h_low + 1;\n  int argmax_w_high = argmax_w_low + 1;\n\n  float weight = 0;\n  if (h == argmax_h_low && w == argmax_w_low)\n    weight = (h + 1 - argmax_h) * (w + 1 - argmax_w);\n  if (h == argmax_h_low && w == argmax_w_high)\n    weight = (h + 1 - argmax_h) * (argmax_w + 1 - w);\n  if (h == argmax_h_high && w == argmax_w_low)\n    weight = (argmax_h + 1 - h) * (w + 1 - argmax_w);\n  if (h == argmax_h_high && w == argmax_w_high)\n    weight = (argmax_h + 1 - h) * (argmax_w + 1 - w);\n  return weight;\n}\n\n__device__ float dmcn_get_coordinate_weight_cuda(float argmax_h, float argmax_w,\n                                            const int height, const int width, const float *im_data,\n                                            const int data_width, const int bp_dir)\n{\n  if (argmax_h <= -1 || argmax_h >= height || argmax_w <= -1 || argmax_w >= width)\n  {\n    //empty\n    return 0;\n  }\n\n  int argmax_h_low = floor(argmax_h);\n  int argmax_w_low = floor(argmax_w);\n  int argmax_h_high = argmax_h_low + 1;\n  int argmax_w_high = argmax_w_low + 1;\n\n  float weight = 0;\n\n  if (bp_dir == 0)\n  {\n    if (argmax_h_low >= 0 && argmax_w_low >= 0)\n      weight += -1 * (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_low * data_width + argmax_w_low];\n    if (argmax_h_low >= 0 && argmax_w_high <= width - 1)\n      weight += -1 * (argmax_w - argmax_w_low) * im_data[argmax_h_low * data_width + argmax_w_high];\n    if (argmax_h_high <= height - 1 && argmax_w_low >= 0)\n      weight += (argmax_w_low + 1 - argmax_w) * im_data[argmax_h_high * data_width + argmax_w_low];\n    if (argmax_h_high <= height - 1 && argmax_w_high <= width - 1)\n      weight += (argmax_w - argmax_w_low) * im_data[argmax_h_high * data_width + argmax_w_high];\n  }\n  else if (bp_dir == 1)\n  {\n    if (argmax_h_low >= 0 && argmax_w_low >= 0)\n      weight += -1 * (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_low];\n    if (argmax_h_low >= 0 && argmax_w_high <= width - 1)\n      weight += (argmax_h_low + 1 - argmax_h) * im_data[argmax_h_low * data_width + argmax_w_high];\n    if (argmax_h_high <= height - 1 && argmax_w_low >= 0)\n      weight += -1 * (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_low];\n    if (argmax_h_high <= height - 1 && argmax_w_high <= width - 1)\n      weight += (argmax_h - argmax_h_low) * im_data[argmax_h_high * data_width + argmax_w_high];\n  }\n\n  return weight;\n}\n\n__global__ void modulated_deformable_im2col_gpu_kernel(const int n,\n                                                       const float *data_im, const float *data_offset, const float *data_mask,\n                                                       const int height, const int width, const int kernel_h, const int kernel_w,\n                                                       const int pad_h, const int pad_w,\n                                                       const int stride_h, const int stride_w,\n                                                       const int dilation_h, const int dilation_w,\n                                                       const int channel_per_deformable_group,\n                                                       const int batch_size, const int num_channels, const int deformable_group,\n                                                       const int height_col, const int width_col,\n                                                       float *data_col)\n{\n  // launch channels * batch_size * height_col * width_col cores\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    // NOTE(CharlesShang): different from Dai Jifeng's MXNet implementation, col_buffer is of shape (c*kw*kh, N, oh, ow)\n    // here columns is of shape (N, c*kw*kh, oh * ow), need to adapt axis\n\n    // index index of output matrix\n    const int w_col = index % width_col;\n    const int h_col = (index / width_col) % height_col;\n    // const int b_col = (index / width_col / height_col) % batch_size;\n    const int b_col = (index / width_col / height_col / num_channels) % batch_size;\n    // const int c_im = (index / width_col / height_col) / batch_size;\n    const int c_im = (index / width_col / height_col) % num_channels;\n    // const int c_col = c_im * kernel_h * kernel_w;\n    const int c_col = c_im * kernel_h * kernel_w;\n\n    // compute deformable group index\n    const int deformable_group_index = c_im / channel_per_deformable_group;\n\n    const int h_in = h_col * stride_h - pad_h;\n    const int w_in = w_col * stride_w - pad_w;\n\n    //  float *data_col_ptr = data_col + ((c_col * batch_size + b_col) * height_col + h_col) * width_col + w_col;\n    float *data_col_ptr = data_col + ((b_col * num_channels * kernel_w * kernel_h + c_col) * height_col + h_col) * width_col + w_col;\n    //const float* data_im_ptr = data_im + ((b_col * num_channels + c_im) * height + h_in) * width + w_in;\n    const float *data_im_ptr = data_im + (b_col * num_channels + c_im) * height * width;\n    const float *data_offset_ptr = data_offset + (b_col * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n\n    const float *data_mask_ptr = data_mask + (b_col * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n\n    for (int i = 0; i < kernel_h; ++i)\n    {\n      for (int j = 0; j < kernel_w; ++j)\n      {\n        const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_col) * width_col + w_col;\n        const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_col) * width_col + w_col;\n        const int data_mask_hw_ptr = ((i * kernel_w + j) * height_col + h_col) * width_col + w_col;\n        const float offset_h = data_offset_ptr[data_offset_h_ptr];\n        const float offset_w = data_offset_ptr[data_offset_w_ptr];\n        const float mask = data_mask_ptr[data_mask_hw_ptr];\n        float val = static_cast<float>(0);\n        const float h_im = h_in + i * dilation_h + offset_h;\n        const float w_im = w_in + j * dilation_w + offset_w;\n        //if (h_im >= 0 && w_im >= 0 && h_im < height && w_im < width) {\n        if (h_im > -1 && w_im > -1 && h_im < height && w_im < width)\n        {\n          //const float map_h = i * dilation_h + offset_h;\n          //const float map_w = j * dilation_w + offset_w;\n          //const int cur_height = height - h_in;\n          //const int cur_width = width - w_in;\n          //val = dmcn_im2col_bilinear_cuda(data_im_ptr, width, cur_height, cur_width, map_h, map_w);\n          val = dmcn_im2col_bilinear_cuda(data_im_ptr, width, height, width, h_im, w_im);\n        }\n        *data_col_ptr = val * mask;\n        // data_col_ptr += batch_size * height_col * width_col;\n        data_col_ptr += height_col * width_col;\n      }\n    }\n  }\n}\n\n__global__ void modulated_deformable_col2im_gpu_kernel(const int n,\n                                                       const float *data_col, const float *data_offset, const float *data_mask,\n                                                       const int channels, const int height, const int width,\n                                                       const int kernel_h, const int kernel_w,\n                                                       const int pad_h, const int pad_w,\n                                                       const int stride_h, const int stride_w,\n                                                       const int dilation_h, const int dilation_w,\n                                                       const int channel_per_deformable_group,\n                                                       const int batch_size, const int deformable_group,\n                                                       const int height_col, const int width_col,\n                                                       float *grad_im)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    const int j = (index / width_col / height_col / batch_size) % kernel_w;\n    const int i = (index / width_col / height_col / batch_size / kernel_w) % kernel_h;\n    const int c = index / width_col / height_col / batch_size / kernel_w / kernel_h;\n    // compute the start and end of the output\n\n    const int deformable_group_index = c / channel_per_deformable_group;\n\n    int w_out = index % width_col;\n    int h_out = (index / width_col) % height_col;\n    int b = (index / width_col / height_col) % batch_size;\n    int w_in = w_out * stride_w - pad_w;\n    int h_in = h_out * stride_h - pad_h;\n\n    const float *data_offset_ptr = data_offset + (b * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n    const float *data_mask_ptr = data_mask + (b * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n    const int data_offset_h_ptr = ((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out;\n    const int data_offset_w_ptr = ((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out;\n    const int data_mask_hw_ptr = ((i * kernel_w + j) * height_col + h_out) * width_col + w_out;\n    const float offset_h = data_offset_ptr[data_offset_h_ptr];\n    const float offset_w = data_offset_ptr[data_offset_w_ptr];\n    const float mask = data_mask_ptr[data_mask_hw_ptr];\n    const float cur_inv_h_data = h_in + i * dilation_h + offset_h;\n    const float cur_inv_w_data = w_in + j * dilation_w + offset_w;\n\n    const float cur_top_grad = data_col[index] * mask;\n    const int cur_h = (int)cur_inv_h_data;\n    const int cur_w = (int)cur_inv_w_data;\n    for (int dy = -2; dy <= 2; dy++)\n    {\n      for (int dx = -2; dx <= 2; dx++)\n      {\n        if (cur_h + dy >= 0 && cur_h + dy < height &&\n            cur_w + dx >= 0 && cur_w + dx < width &&\n            abs(cur_inv_h_data - (cur_h + dy)) < 1 &&\n            abs(cur_inv_w_data - (cur_w + dx)) < 1)\n        {\n          int cur_bottom_grad_pos = ((b * channels + c) * height + cur_h + dy) * width + cur_w + dx;\n          float weight = dmcn_get_gradient_weight_cuda(cur_inv_h_data, cur_inv_w_data, cur_h + dy, cur_w + dx, height, width);\n          atomicAdd(grad_im + cur_bottom_grad_pos, weight * cur_top_grad);\n        }\n      }\n    }\n  }\n}\n\n__global__ void modulated_deformable_col2im_coord_gpu_kernel(const int n,\n                                                             const float *data_col, const float *data_im,\n                                                             const float *data_offset, const float *data_mask,\n                                                             const int channels, const int height, const int width,\n                                                             const int kernel_h, const int kernel_w,\n                                                             const int pad_h, const int pad_w,\n                                                             const int stride_h, const int stride_w,\n                                                             const int dilation_h, const int dilation_w,\n                                                             const int channel_per_deformable_group,\n                                                             const int batch_size, const int offset_channels, const int deformable_group,\n                                                             const int height_col, const int width_col,\n                                                             float *grad_offset, float *grad_mask)\n{\n  CUDA_KERNEL_LOOP(index, n)\n  {\n    float val = 0, mval = 0;\n    int w = index % width_col;\n    int h = (index / width_col) % height_col;\n    int c = (index / width_col / height_col) % offset_channels;\n    int b = (index / width_col / height_col) / offset_channels;\n    // compute the start and end of the output\n\n    const int deformable_group_index = c / (2 * kernel_h * kernel_w);\n    const int col_step = kernel_h * kernel_w;\n    int cnt = 0;\n    const float *data_col_ptr = data_col + deformable_group_index * channel_per_deformable_group * batch_size * width_col * height_col;\n    const float *data_im_ptr = data_im + (b * deformable_group + deformable_group_index) * channel_per_deformable_group / kernel_h / kernel_w * height * width;\n    const float *data_offset_ptr = data_offset + (b * deformable_group + deformable_group_index) * 2 * kernel_h * kernel_w * height_col * width_col;\n    const float *data_mask_ptr = data_mask + (b * deformable_group + deformable_group_index) * kernel_h * kernel_w * height_col * width_col;\n\n    const int offset_c = c - deformable_group_index * 2 * kernel_h * kernel_w;\n\n    for (int col_c = (offset_c / 2); col_c < channel_per_deformable_group; col_c += col_step)\n    {\n      const int col_pos = (((col_c * batch_size + b) * height_col) + h) * width_col + w;\n      const int bp_dir = offset_c % 2;\n\n      int j = (col_pos / width_col / height_col / batch_size) % kernel_w;\n      int i = (col_pos / width_col / height_col / batch_size / kernel_w) % kernel_h;\n      int w_out = col_pos % width_col;\n      int h_out = (col_pos / width_col) % height_col;\n      int w_in = w_out * stride_w - pad_w;\n      int h_in = h_out * stride_h - pad_h;\n      const int data_offset_h_ptr = (((2 * (i * kernel_w + j)) * height_col + h_out) * width_col + w_out);\n      const int data_offset_w_ptr = (((2 * (i * kernel_w + j) + 1) * height_col + h_out) * width_col + w_out);\n      const int data_mask_hw_ptr = (((i * kernel_w + j) * height_col + h_out) * width_col + w_out);\n      const float offset_h = data_offset_ptr[data_offset_h_ptr];\n      const float offset_w = data_offset_ptr[data_offset_w_ptr];\n      const float mask = data_mask_ptr[data_mask_hw_ptr];\n      float inv_h = h_in + i * dilation_h + offset_h;\n      float inv_w = w_in + j * dilation_w + offset_w;\n      if (inv_h <= -1 || inv_w <= -1 || inv_h >= height || inv_w >= width)\n      {\n        inv_h = inv_w = -2;\n      }\n      else\n      {\n        mval += data_col_ptr[col_pos] * dmcn_im2col_bilinear_cuda(data_im_ptr + cnt * height * width, width, height, width, inv_h, inv_w);\n      }\n      const float weight = dmcn_get_coordinate_weight_cuda(\n          inv_h, inv_w,\n          height, width, data_im_ptr + cnt * height * width, width, bp_dir);\n      val += weight * data_col_ptr[col_pos] * mask;\n      cnt += 1;\n    }\n    // KERNEL_ASSIGN(grad_offset[index], offset_req, val);\n    grad_offset[index] = val;\n    if (offset_c % 2 == 0)\n      // KERNEL_ASSIGN(grad_mask[(((b * deformable_group + deformable_group_index) * kernel_h * kernel_w + offset_c / 2) * height_col + h) * width_col + w], mask_req, mval);\n      grad_mask[(((b * deformable_group + deformable_group_index) * kernel_h * kernel_w + offset_c / 2) * height_col + h) * width_col + w] = mval;\n  }\n}\n\nvoid modulated_deformable_im2col_cuda(cudaStream_t stream,\n  const float* data_im, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w,\n  const int deformable_group, float* data_col) {\n  // num_axes should be smaller than block size\n  const int channel_per_deformable_group = channels / deformable_group;\n  const int num_kernels = channels * batch_size * height_col * width_col;\n  modulated_deformable_im2col_gpu_kernel\n      <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,\n          0, stream>>>(\n      num_kernels, data_im, data_offset, data_mask, height_im, width_im, kernel_h, kernel_w,\n      pad_h, pad_w, stride_h, stride_w, dilation_h, dilation_w, channel_per_deformable_group,\n      batch_size, channels, deformable_group, height_col, width_col, data_col);\n  \n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_im2col_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\nvoid modulated_deformable_col2im_cuda(cudaStream_t stream,\n  const float* data_col, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w, \n  const int deformable_group, float* grad_im){\n\n  const int channel_per_deformable_group = channels / deformable_group;\n  const int num_kernels = channels * kernel_h * kernel_w * batch_size * height_col * width_col;\n  modulated_deformable_col2im_gpu_kernel\n      <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,\n          0, stream>>>(\n        num_kernels, data_col, data_offset, data_mask, channels, height_im, width_im,\n        kernel_h, kernel_w, pad_h, pad_h, stride_h, stride_w,\n        dilation_h, dilation_w, channel_per_deformable_group,\n        batch_size, deformable_group, height_col, width_col, grad_im);\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_col2im_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n\n}\n\nvoid modulated_deformable_col2im_coord_cuda(cudaStream_t stream,\n  const float* data_col, const float* data_im, const float* data_offset, const float* data_mask,\n  const int batch_size, const int channels, const int height_im, const int width_im, \n  const int height_col, const int width_col, const int kernel_h, const int kernel_w,\n  const int pad_h, const int pad_w, const int stride_h, const int stride_w, \n  const int dilation_h, const int dilation_w, \n  const int deformable_group,\n  float* grad_offset, float* grad_mask) {\n  const int num_kernels = batch_size * height_col * width_col * 2 * kernel_h * kernel_w * deformable_group;\n  const int channel_per_deformable_group = channels * kernel_h * kernel_w / deformable_group;\n  modulated_deformable_col2im_coord_gpu_kernel\n      <<<GET_BLOCKS(num_kernels), CUDA_NUM_THREADS,\n        0, stream>>>(\n        num_kernels, data_col, data_im, data_offset, data_mask, channels, height_im, width_im,\n        kernel_h, kernel_w, pad_h, pad_w, stride_h, stride_w,\n        dilation_h, dilation_w, channel_per_deformable_group,\n        batch_size, 2 * kernel_h * kernel_w * deformable_group, deformable_group, height_col, width_col, \n        grad_offset, grad_mask);\n  cudaError_t err = cudaGetLastError();\n  if (err != cudaSuccess)\n  {\n    printf(\"error in modulated_deformable_col2im_coord_cuda: %s\\n\", cudaGetErrorString(err));\n  }\n}"
  },
  {
    "path": "DCNv2/src/cuda/dcn_v2_im2col_cuda.h",
    "content": "\n/*!\n ******************* BEGIN Caffe Copyright Notice and Disclaimer ****************\n *\n * COPYRIGHT\n *\n * All contributions by the University of California:\n * Copyright (c) 2014-2017 The Regents of the University of California (Regents)\n * All rights reserved.\n *\n * All other contributions:\n * Copyright (c) 2014-2017, the respective contributors\n * All rights reserved.\n *\n * Caffe uses a shared copyright model: each contributor holds copyright over\n * their contributions to Caffe. The project versioning records all such\n * contribution and copyright details. If a contributor wants to further mark\n * their specific copyright on a particular contribution, they should indicate\n * their copyright solely in the commit message of the change when it is\n * committed.\n *\n * LICENSE\n *\n * Redistribution and use in source and binary forms, with or without\n * modification, are permitted provided that the following conditions are met:\n *\n * 1. Redistributions of source code must retain the above copyright notice, this\n * list of conditions and the following disclaimer.\n * 2. Redistributions in binary form must reproduce the above copyright notice,\n * this list of conditions and the following disclaimer in the documentation\n * and/or other materials provided with the distribution.\n *\n * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS \"AS IS\" AND\n * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED\n * WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE\n * DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR\n * ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES\n * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;\n * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND\n * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT\n * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS\n * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.\n *\n * CONTRIBUTION AGREEMENT\n *\n * By contributing to the BVLC/caffe repository through pull-request, comment,\n * or otherwise, the contributor releases their content to the\n * license and copyright terms herein.\n *\n ***************** END Caffe Copyright Notice and Disclaimer ********************\n *\n * Copyright (c) 2018 Microsoft\n * Licensed under The MIT License [see LICENSE for details]\n * \\file modulated_deformable_im2col.h\n * \\brief Function definitions of converting an image to\n * column matrix based on kernel, padding, dilation, and offset.\n * These functions are mainly used in deformable convolution operators.\n * \\ref: https://arxiv.org/abs/1811.11168\n * \\author Yuwen Xiong, Haozhi Qi, Jifeng Dai, Xizhou Zhu, Han Hu\n */\n\n/***************** Adapted by Charles Shang *********************/\n\n#ifndef DCN_V2_IM2COL_CUDA\n#define DCN_V2_IM2COL_CUDA\n\n#ifdef __cplusplus\nextern \"C\"\n{\n#endif\n\n  void modulated_deformable_im2col_cuda(cudaStream_t stream,\n                                        const float *data_im, const float *data_offset, const float *data_mask,\n                                        const int batch_size, const int channels, const int height_im, const int width_im,\n                                        const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                        const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                        const int dilation_h, const int dilation_w,\n                                        const int deformable_group, float *data_col);\n\n  void modulated_deformable_col2im_cuda(cudaStream_t stream,\n                                        const float *data_col, const float *data_offset, const float *data_mask,\n                                        const int batch_size, const int channels, const int height_im, const int width_im,\n                                        const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                        const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                        const int dilation_h, const int dilation_w,\n                                        const int deformable_group, float *grad_im);\n\n  void modulated_deformable_col2im_coord_cuda(cudaStream_t stream,\n                                         const float *data_col, const float *data_im, const float *data_offset, const float *data_mask,\n                                         const int batch_size, const int channels, const int height_im, const int width_im,\n                                         const int height_col, const int width_col, const int kernel_h, const int kenerl_w,\n                                         const int pad_h, const int pad_w, const int stride_h, const int stride_w,\n                                         const int dilation_h, const int dilation_w,\n                                         const int deformable_group,\n                                         float *grad_offset, float *grad_mask);\n\n#ifdef __cplusplus\n}\n#endif\n\n#endif"
  },
  {
    "path": "DCNv2/src/cuda/dcn_v2_psroi_pooling_cuda.cu",
    "content": "/*!\n * Copyright (c) 2017 Microsoft\n * Licensed under The MIT License [see LICENSE for details]\n * \\file deformable_psroi_pooling.cu\n * \\brief\n * \\author Yi Li, Guodong Zhang, Jifeng Dai\n*/\n/***************** Adapted by Charles Shang *********************/\n\n#include <cstdio>\n#include <algorithm>\n#include <cstring>\n#include <iostream>\n\n#include <ATen/ATen.h>\n#include <ATen/cuda/CUDAContext.h>\n\n#include <THC/THC.h>\n#include <THC/THCAtomics.cuh>\n#include <THC/THCDeviceUtils.cuh>\n\n#define CUDA_KERNEL_LOOP(i, n)                        \\\n  for (int i = blockIdx.x * blockDim.x + threadIdx.x; \\\n       i < (n);                                       \\\n       i += blockDim.x * gridDim.x)\n\nconst int CUDA_NUM_THREADS = 1024;\ninline int GET_BLOCKS(const int N)\n{\n  return (N + CUDA_NUM_THREADS - 1) / CUDA_NUM_THREADS;\n}\n\ntemplate <typename T>\n__device__ T bilinear_interp_cuda(\n    const T *data,\n    const T x,\n    const T y,\n    const int width,\n    const int height)\n{\n  int x1 = floor(x);\n  int x2 = ceil(x);\n  int y1 = floor(y);\n  int y2 = ceil(y);\n  T dist_x = static_cast<T>(x - x1);\n  T dist_y = static_cast<T>(y - y1);\n  T value11 = data[y1 * width + x1];\n  T value12 = data[y2 * width + x1];\n  T value21 = data[y1 * width + x2];\n  T value22 = data[y2 * width + x2];\n  T value = (1 - dist_x) * (1 - dist_y) * value11 +\n            (1 - dist_x) * dist_y * value12 +\n            dist_x * (1 - dist_y) * value21 +\n            dist_x * dist_y * value22;\n  return value;\n}\n\ntemplate <typename T>\n__global__ void DeformablePSROIPoolForwardKernelCuda(\n    const int count,\n    const T *bottom_data,\n    const T spatial_scale,\n    const int channels,\n    const int height, const int width,\n    const int pooled_height, const int pooled_width,\n    const T *bottom_rois, const T *bottom_trans,\n    const int no_trans,\n    const T trans_std,\n    const int sample_per_part,\n    const int output_dim,\n    const int group_size,\n    const int part_size,\n    const int num_classes,\n    const int channels_each_class,\n    T *top_data,\n    T *top_count)\n{\n  CUDA_KERNEL_LOOP(index, count)\n  {\n    // The output is in order (n, ctop, ph, pw)\n    int pw = index % pooled_width;\n    int ph = (index / pooled_width) % pooled_height;\n    int ctop = (index / pooled_width / pooled_height) % output_dim;\n    int n = index / pooled_width / pooled_height / output_dim;\n\n    // [start, end) interval for spatial sampling\n    const T *offset_bottom_rois = bottom_rois + n * 5;\n    int roi_batch_ind = offset_bottom_rois[0];\n    T roi_start_w = static_cast<T>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;\n    T roi_start_h = static_cast<T>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;\n    T roi_end_w = static_cast<T>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;\n    T roi_end_h = static_cast<T>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;\n\n    // Force too small ROIs to be 1x1\n    T roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0\n    T roi_height = max(roi_end_h - roi_start_h, 0.1);\n\n    // Compute w and h at bottom\n    T bin_size_h = roi_height / static_cast<T>(pooled_height);\n    T bin_size_w = roi_width / static_cast<T>(pooled_width);\n\n    T sub_bin_size_h = bin_size_h / static_cast<T>(sample_per_part);\n    T sub_bin_size_w = bin_size_w / static_cast<T>(sample_per_part);\n\n    int part_h = floor(static_cast<T>(ph) / pooled_height * part_size);\n    int part_w = floor(static_cast<T>(pw) / pooled_width * part_size);\n    int class_id = ctop / channels_each_class;\n    T trans_x = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;\n    T trans_y = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;\n\n    T wstart = static_cast<T>(pw) * bin_size_w + roi_start_w;\n    wstart += trans_x * roi_width;\n    T hstart = static_cast<T>(ph) * bin_size_h + roi_start_h;\n    hstart += trans_y * roi_height;\n\n    T sum = 0;\n    int count = 0;\n    int gw = floor(static_cast<T>(pw) * group_size / pooled_width);\n    int gh = floor(static_cast<T>(ph) * group_size / pooled_height);\n    gw = min(max(gw, 0), group_size - 1);\n    gh = min(max(gh, 0), group_size - 1);\n\n    const T *offset_bottom_data = bottom_data + (roi_batch_ind * channels) * height * width;\n    for (int ih = 0; ih < sample_per_part; ih++)\n    {\n      for (int iw = 0; iw < sample_per_part; iw++)\n      {\n        T w = wstart + iw * sub_bin_size_w;\n        T h = hstart + ih * sub_bin_size_h;\n        // bilinear interpolation\n        if (w < -0.5 || w > width - 0.5 || h < -0.5 || h > height - 0.5)\n        {\n          continue;\n        }\n        w = min(max(w, 0.), width - 1.);\n        h = min(max(h, 0.), height - 1.);\n        int c = (ctop * group_size + gh) * group_size + gw;\n        T val = bilinear_interp_cuda(offset_bottom_data + c * height * width, w, h, width, height);\n        sum += val;\n        count++;\n      }\n    }\n    top_data[index] = count == 0 ? static_cast<T>(0) : sum / count;\n    top_count[index] = count;\n  }\n}\n\ntemplate <typename T>\n__global__ void DeformablePSROIPoolBackwardAccKernelCuda(\n    const int count,\n    const T *top_diff,\n    const T *top_count,\n    const int num_rois,\n    const T spatial_scale,\n    const int channels,\n    const int height, const int width,\n    const int pooled_height, const int pooled_width,\n    const int output_dim,\n    T *bottom_data_diff, T *bottom_trans_diff,\n    const T *bottom_data,\n    const T *bottom_rois,\n    const T *bottom_trans,\n    const int no_trans,\n    const T trans_std,\n    const int sample_per_part,\n    const int group_size,\n    const int part_size,\n    const int num_classes,\n    const int channels_each_class)\n{\n  CUDA_KERNEL_LOOP(index, count)\n  {\n    // The output is in order (n, ctop, ph, pw)\n    int pw = index % pooled_width;\n    int ph = (index / pooled_width) % pooled_height;\n    int ctop = (index / pooled_width / pooled_height) % output_dim;\n    int n = index / pooled_width / pooled_height / output_dim;\n\n    // [start, end) interval for spatial sampling\n    const T *offset_bottom_rois = bottom_rois + n * 5;\n    int roi_batch_ind = offset_bottom_rois[0];\n    T roi_start_w = static_cast<T>(round(offset_bottom_rois[1])) * spatial_scale - 0.5;\n    T roi_start_h = static_cast<T>(round(offset_bottom_rois[2])) * spatial_scale - 0.5;\n    T roi_end_w = static_cast<T>(round(offset_bottom_rois[3]) + 1.) * spatial_scale - 0.5;\n    T roi_end_h = static_cast<T>(round(offset_bottom_rois[4]) + 1.) * spatial_scale - 0.5;\n\n    // Force too small ROIs to be 1x1\n    T roi_width = max(roi_end_w - roi_start_w, 0.1); //avoid 0\n    T roi_height = max(roi_end_h - roi_start_h, 0.1);\n\n    // Compute w and h at bottom\n    T bin_size_h = roi_height / static_cast<T>(pooled_height);\n    T bin_size_w = roi_width / static_cast<T>(pooled_width);\n\n    T sub_bin_size_h = bin_size_h / static_cast<T>(sample_per_part);\n    T sub_bin_size_w = bin_size_w / static_cast<T>(sample_per_part);\n\n    int part_h = floor(static_cast<T>(ph) / pooled_height * part_size);\n    int part_w = floor(static_cast<T>(pw) / pooled_width * part_size);\n    int class_id = ctop / channels_each_class;\n    T trans_x = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w] * trans_std;\n    T trans_y = no_trans ? static_cast<T>(0) : bottom_trans[(((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w] * trans_std;\n\n    T wstart = static_cast<T>(pw) * bin_size_w + roi_start_w;\n    wstart += trans_x * roi_width;\n    T hstart = static_cast<T>(ph) * bin_size_h + roi_start_h;\n    hstart += trans_y * roi_height;\n\n    if (top_count[index] <= 0)\n    {\n      continue;\n    }\n    T diff_val = top_diff[index] / top_count[index];\n    const T *offset_bottom_data = bottom_data + roi_batch_ind * channels * height * width;\n    T *offset_bottom_data_diff = bottom_data_diff + roi_batch_ind * channels * height * width;\n    int gw = floor(static_cast<T>(pw) * group_size / pooled_width);\n    int gh = floor(static_cast<T>(ph) * group_size / pooled_height);\n    gw = min(max(gw, 0), group_size - 1);\n    gh = min(max(gh, 0), group_size - 1);\n\n    for (int ih = 0; ih < sample_per_part; ih++)\n    {\n      for (int iw = 0; iw < sample_per_part; iw++)\n      {\n        T w = wstart + iw * sub_bin_size_w;\n        T h = hstart + ih * sub_bin_size_h;\n        // bilinear interpolation\n        if (w < -0.5 || w > width - 0.5 || h < -0.5 || h > height - 0.5)\n        {\n          continue;\n        }\n        w = min(max(w, 0.), width - 1.);\n        h = min(max(h, 0.), height - 1.);\n        int c = (ctop * group_size + gh) * group_size + gw;\n        // backward on feature\n        int x0 = floor(w);\n        int x1 = ceil(w);\n        int y0 = floor(h);\n        int y1 = ceil(h);\n        T dist_x = w - x0, dist_y = h - y0;\n        T q00 = (1 - dist_x) * (1 - dist_y);\n        T q01 = (1 - dist_x) * dist_y;\n        T q10 = dist_x * (1 - dist_y);\n        T q11 = dist_x * dist_y;\n        int bottom_index_base = c * height * width;\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y0 * width + x0, q00 * diff_val);\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y1 * width + x0, q01 * diff_val);\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y0 * width + x1, q10 * diff_val);\n        atomicAdd(offset_bottom_data_diff + bottom_index_base + y1 * width + x1, q11 * diff_val);\n\n        if (no_trans)\n        {\n          continue;\n        }\n        T U00 = offset_bottom_data[bottom_index_base + y0 * width + x0];\n        T U01 = offset_bottom_data[bottom_index_base + y1 * width + x0];\n        T U10 = offset_bottom_data[bottom_index_base + y0 * width + x1];\n        T U11 = offset_bottom_data[bottom_index_base + y1 * width + x1];\n        T diff_x = (U11 * dist_y + U10 * (1 - dist_y) - U01 * dist_y - U00 * (1 - dist_y)) * trans_std * diff_val;\n        diff_x *= roi_width;\n        T diff_y = (U11 * dist_x + U01 * (1 - dist_x) - U10 * dist_x - U00 * (1 - dist_x)) * trans_std * diff_val;\n        diff_y *= roi_height;\n\n        atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2) * part_size + part_h) * part_size + part_w, diff_x);\n        atomicAdd(bottom_trans_diff + (((n * num_classes + class_id) * 2 + 1) * part_size + part_h) * part_size + part_w, diff_y);\n      }\n    }\n  }\n}\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cuda_forward(const at::Tensor &input,\n                                  const at::Tensor &bbox,\n                                  const at::Tensor &trans,\n                                  const int no_trans,\n                                  const float spatial_scale,\n                                  const int output_dim,\n                                  const int group_size,\n                                  const int pooled_size,\n                                  const int part_size,\n                                  const int sample_per_part,\n                                  const float trans_std)\n{\n  AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n  AT_ASSERTM(bbox.type().is_cuda(), \"rois must be a CUDA tensor\");\n  AT_ASSERTM(trans.type().is_cuda(), \"trans must be a CUDA tensor\");\n\n  const int batch = input.size(0);\n  const int channels = input.size(1);\n  const int height = input.size(2);\n  const int width = input.size(3);\n  const int channels_trans = no_trans ? 2 : trans.size(1);\n  const int num_bbox = bbox.size(0);\n\n  AT_ASSERTM(channels == output_dim, \"input channels and output channels must equal\");\n  auto pooled_height = pooled_size;\n  auto pooled_width = pooled_size;\n\n  auto out = at::empty({num_bbox, output_dim, pooled_height, pooled_width}, input.options());\n  long out_size = num_bbox * output_dim * pooled_height * pooled_width;\n  auto top_count = at::zeros({num_bbox, output_dim, pooled_height, pooled_width}, input.options());\n\n  const int num_classes = no_trans ? 1 : channels_trans / 2;\n  const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;\n\n  cudaStream_t stream = at::cuda::getCurrentCUDAStream();\n\n  if (out.numel() == 0)\n  {\n    THCudaCheck(cudaGetLastError());\n    return std::make_tuple(out, top_count);\n  }\n\n  dim3 grid(std::min(THCCeilDiv(out_size, 512L), 4096L));\n  dim3 block(512);\n\n  AT_DISPATCH_FLOATING_TYPES(input.type(), \"dcn_v2_psroi_pooling_cuda_forward\", [&] {\n    DeformablePSROIPoolForwardKernelCuda<scalar_t><<<grid, block, 0, stream>>>(\n        out_size,\n        input.contiguous().data<scalar_t>(),\n        spatial_scale,\n        channels,\n        height, width,\n        pooled_height,\n        pooled_width,\n        bbox.contiguous().data<scalar_t>(),\n        trans.contiguous().data<scalar_t>(),\n        no_trans,\n        trans_std,\n        sample_per_part,\n        output_dim,\n        group_size,\n        part_size,\n        num_classes,\n        channels_each_class,\n        out.data<scalar_t>(),\n        top_count.data<scalar_t>());\n  });\n  THCudaCheck(cudaGetLastError());\n  return std::make_tuple(out, top_count);\n}\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cuda_backward(const at::Tensor &out_grad,\n                                   const at::Tensor &input,\n                                   const at::Tensor &bbox,\n                                   const at::Tensor &trans,\n                                   const at::Tensor &top_count,\n                                   const int no_trans,\n                                   const float spatial_scale,\n                                   const int output_dim,\n                                   const int group_size,\n                                   const int pooled_size,\n                                   const int part_size,\n                                   const int sample_per_part,\n                                   const float trans_std)\n{\n  AT_ASSERTM(out_grad.type().is_cuda(), \"out_grad must be a CUDA tensor\");\n  AT_ASSERTM(input.type().is_cuda(), \"input must be a CUDA tensor\");\n  AT_ASSERTM(bbox.type().is_cuda(), \"bbox must be a CUDA tensor\");\n  AT_ASSERTM(trans.type().is_cuda(), \"trans must be a CUDA tensor\");\n  AT_ASSERTM(top_count.type().is_cuda(), \"top_count must be a CUDA tensor\");\n\n  const int batch = input.size(0);\n  const int channels = input.size(1);\n  const int height = input.size(2);\n  const int width = input.size(3);\n  const int channels_trans = no_trans ? 2 : trans.size(1);\n  const int num_bbox = bbox.size(0);\n\n  AT_ASSERTM(channels == output_dim, \"input channels and output channels must equal\");\n  auto pooled_height = pooled_size;\n  auto pooled_width = pooled_size;\n  long out_size = num_bbox * output_dim * pooled_height * pooled_width;\n  const int num_classes = no_trans ? 1 : channels_trans / 2;\n  const int channels_each_class = no_trans ? output_dim : output_dim / num_classes;\n\n  auto input_grad = at::zeros({batch, channels, height, width}, out_grad.options());\n  auto trans_grad = at::zeros_like(trans);\n\n  if (input_grad.numel() == 0)\n  {\n    THCudaCheck(cudaGetLastError());\n    return std::make_tuple(input_grad, trans_grad);\n  }\n\n  dim3 grid(std::min(THCCeilDiv(out_size, 512L), 4096L));\n  dim3 block(512);\n  cudaStream_t stream = at::cuda::getCurrentCUDAStream();\n\n  AT_DISPATCH_FLOATING_TYPES(out_grad.type(), \"dcn_v2_psroi_pooling_cuda_backward\", [&] {\n    DeformablePSROIPoolBackwardAccKernelCuda<scalar_t><<<grid, block, 0, stream>>>(\n        out_size,\n        out_grad.contiguous().data<scalar_t>(),\n        top_count.contiguous().data<scalar_t>(),\n        num_bbox,\n        spatial_scale,\n        channels,\n        height,\n        width,\n        pooled_height,\n        pooled_width,\n        output_dim,\n        input_grad.contiguous().data<scalar_t>(),\n        trans_grad.contiguous().data<scalar_t>(),\n        input.contiguous().data<scalar_t>(),\n        bbox.contiguous().data<scalar_t>(),\n        trans.contiguous().data<scalar_t>(),\n        no_trans,\n        trans_std,\n        sample_per_part,\n        group_size,\n        part_size,\n        num_classes,\n        channels_each_class);\n  });\n  THCudaCheck(cudaGetLastError());\n  return std::make_tuple(input_grad, trans_grad);\n}"
  },
  {
    "path": "DCNv2/src/cuda/vision.h",
    "content": "#pragma once\n#include <torch/extension.h>\n\nat::Tensor\ndcn_v2_cuda_forward(const at::Tensor &input,\n                    const at::Tensor &weight,\n                    const at::Tensor &bias,\n                    const at::Tensor &offset,\n                    const at::Tensor &mask,\n                    const int kernel_h,\n                    const int kernel_w,\n                    const int stride_h,\n                    const int stride_w,\n                    const int pad_h,\n                    const int pad_w,\n                    const int dilation_h,\n                    const int dilation_w,\n                    const int deformable_group);\n\nstd::vector<at::Tensor>\ndcn_v2_cuda_backward(const at::Tensor &input,\n                     const at::Tensor &weight,\n                     const at::Tensor &bias,\n                     const at::Tensor &offset,\n                     const at::Tensor &mask,\n                     const at::Tensor &grad_output,\n                     int kernel_h, int kernel_w,\n                     int stride_h, int stride_w,\n                     int pad_h, int pad_w,\n                     int dilation_h, int dilation_w,\n                     int deformable_group);\n\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cuda_forward(const at::Tensor &input,\n                                  const at::Tensor &bbox,\n                                  const at::Tensor &trans,\n                                  const int no_trans,\n                                  const float spatial_scale,\n                                  const int output_dim,\n                                  const int group_size,\n                                  const int pooled_size,\n                                  const int part_size,\n                                  const int sample_per_part,\n                                  const float trans_std);\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_cuda_backward(const at::Tensor &out_grad,\n                                   const at::Tensor &input,\n                                   const at::Tensor &bbox,\n                                   const at::Tensor &trans,\n                                   const at::Tensor &top_count,\n                                   const int no_trans,\n                                   const float spatial_scale,\n                                   const int output_dim,\n                                   const int group_size,\n                                   const int pooled_size,\n                                   const int part_size,\n                                   const int sample_per_part,\n                                   const float trans_std);"
  },
  {
    "path": "DCNv2/src/dcn_v2.h",
    "content": "#pragma once\n\n#include \"cpu/vision.h\"\n\n#ifdef WITH_CUDA\n#include \"cuda/vision.h\"\n#endif\n\nat::Tensor\ndcn_v2_forward(const at::Tensor &input,\n               const at::Tensor &weight,\n               const at::Tensor &bias,\n               const at::Tensor &offset,\n               const at::Tensor &mask,\n               const int kernel_h,\n               const int kernel_w,\n               const int stride_h,\n               const int stride_w,\n               const int pad_h,\n               const int pad_w,\n               const int dilation_h,\n               const int dilation_w,\n               const int deformable_group)\n{\n    if (input.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return dcn_v2_cuda_forward(input, weight, bias, offset, mask,\n                                   kernel_h, kernel_w,\n                                   stride_h, stride_w,\n                                   pad_h, pad_w,\n                                   dilation_h, dilation_w,\n                                   deformable_group);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    else{\n        return dcn_v2_cpu_forward(input, weight, bias, offset, mask,\n                                   kernel_h, kernel_w,\n                                   stride_h, stride_w,\n                                   pad_h, pad_w,\n                                   dilation_h, dilation_w,\n                                   deformable_group);\n    }\n}\n\nstd::vector<at::Tensor>\ndcn_v2_backward(const at::Tensor &input,\n                const at::Tensor &weight,\n                const at::Tensor &bias,\n                const at::Tensor &offset,\n                const at::Tensor &mask,\n                const at::Tensor &grad_output,\n                int kernel_h, int kernel_w,\n                int stride_h, int stride_w,\n                int pad_h, int pad_w,\n                int dilation_h, int dilation_w,\n                int deformable_group)\n{\n    if (input.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return dcn_v2_cuda_backward(input,\n                                    weight,\n                                    bias,\n                                    offset,\n                                    mask,\n                                    grad_output,\n                                    kernel_h, kernel_w,\n                                    stride_h, stride_w,\n                                    pad_h, pad_w,\n                                    dilation_h, dilation_w,\n                                    deformable_group);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    else{\n        return dcn_v2_cpu_backward(input,\n                                    weight,\n                                    bias,\n                                    offset,\n                                    mask,\n                                    grad_output,\n                                    kernel_h, kernel_w,\n                                    stride_h, stride_w,\n                                    pad_h, pad_w,\n                                    dilation_h, dilation_w,\n                                    deformable_group);\n    }\n}\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_forward(const at::Tensor &input,\n                             const at::Tensor &bbox,\n                             const at::Tensor &trans,\n                             const int no_trans,\n                             const float spatial_scale,\n                             const int output_dim,\n                             const int group_size,\n                             const int pooled_size,\n                             const int part_size,\n                             const int sample_per_part,\n                             const float trans_std)\n{\n    if (input.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return dcn_v2_psroi_pooling_cuda_forward(input,\n                                                 bbox,\n                                                 trans,\n                                                 no_trans,\n                                                 spatial_scale,\n                                                 output_dim,\n                                                 group_size,\n                                                 pooled_size,\n                                                 part_size,\n                                                 sample_per_part,\n                                                 trans_std);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    else{\n        return dcn_v2_psroi_pooling_cpu_forward(input,\n                                                 bbox,\n                                                 trans,\n                                                 no_trans,\n                                                 spatial_scale,\n                                                 output_dim,\n                                                 group_size,\n                                                 pooled_size,\n                                                 part_size,\n                                                 sample_per_part,\n                                                 trans_std);\n    }\n}\n\nstd::tuple<at::Tensor, at::Tensor>\ndcn_v2_psroi_pooling_backward(const at::Tensor &out_grad,\n                              const at::Tensor &input,\n                              const at::Tensor &bbox,\n                              const at::Tensor &trans,\n                              const at::Tensor &top_count,\n                              const int no_trans,\n                              const float spatial_scale,\n                              const int output_dim,\n                              const int group_size,\n                              const int pooled_size,\n                              const int part_size,\n                              const int sample_per_part,\n                              const float trans_std)\n{\n    if (input.type().is_cuda())\n    {\n#ifdef WITH_CUDA\n        return dcn_v2_psroi_pooling_cuda_backward(out_grad,\n                                                  input,\n                                                  bbox,\n                                                  trans,\n                                                  top_count,\n                                                  no_trans,\n                                                  spatial_scale,\n                                                  output_dim,\n                                                  group_size,\n                                                  pooled_size,\n                                                  part_size,\n                                                  sample_per_part,\n                                                  trans_std);\n#else\n        AT_ERROR(\"Not compiled with GPU support\");\n#endif\n    }\n    else{\n        return dcn_v2_psroi_pooling_cpu_backward(out_grad,\n                                                  input,\n                                                  bbox,\n                                                  trans,\n                                                  top_count,\n                                                  no_trans,\n                                                  spatial_scale,\n                                                  output_dim,\n                                                  group_size,\n                                                  pooled_size,\n                                                  part_size,\n                                                  sample_per_part,\n                                                  trans_std);\n    }\n}"
  },
  {
    "path": "DCNv2/src/vision.cpp",
    "content": "\n#include \"dcn_v2.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n  m.def(\"dcn_v2_forward\", &dcn_v2_forward, \"dcn_v2_forward\");\n  m.def(\"dcn_v2_backward\", &dcn_v2_backward, \"dcn_v2_backward\");\n  m.def(\"dcn_v2_psroi_pooling_forward\", &dcn_v2_psroi_pooling_forward, \"dcn_v2_psroi_pooling_forward\");\n  m.def(\"dcn_v2_psroi_pooling_backward\", &dcn_v2_psroi_pooling_backward, \"dcn_v2_psroi_pooling_backward\");\n}\n"
  },
  {
    "path": "DCNv2/test/test.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import absolute_import, division, print_function\n\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import gradcheck\n\nfrom dcn_v2 import DCN, DCNPooling, DCNv2, DCNv2Pooling, dcn_v2_conv, dcn_v2_pooling\n\ndeformable_groups = 1\nN, inC, inH, inW = 2, 2, 4, 4\noutC = 2\nkH, kW = 3, 3\n\n\ndef conv_identify(weight, bias):\n    weight.data.zero_()\n    bias.data.zero_()\n    o, i, h, w = weight.shape\n    y = h // 2\n    x = w // 2\n    for p in range(i):\n        for q in range(o):\n            if p == q:\n                weight.data[q, p, y, x] = 1.0\n\n\ndef check_zero_offset():\n    conv_offset = nn.Conv2d(\n        inC,\n        deformable_groups * 2 * kH * kW,\n        kernel_size=(kH, kW),\n        stride=(1, 1),\n        padding=(1, 1),\n        bias=True,\n    ).cuda()\n\n    conv_mask = nn.Conv2d(\n        inC,\n        deformable_groups * 1 * kH * kW,\n        kernel_size=(kH, kW),\n        stride=(1, 1),\n        padding=(1, 1),\n        bias=True,\n    ).cuda()\n\n    dcn_v2 = DCNv2(inC, outC, (kH, kW), stride=1, padding=1, dilation=1, deformable_groups=deformable_groups).cuda()\n\n    conv_offset.weight.data.zero_()\n    conv_offset.bias.data.zero_()\n    conv_mask.weight.data.zero_()\n    conv_mask.bias.data.zero_()\n    conv_identify(dcn_v2.weight, dcn_v2.bias)\n\n    input = torch.randn(N, inC, inH, inW).cuda()\n    offset = conv_offset(input)\n    mask = conv_mask(input)\n    mask = torch.sigmoid(mask)\n    output = dcn_v2(input, offset, mask)\n    output *= 2\n    d = (input - output).abs().max()\n    if d < 1e-10:\n        print(\"Zero offset passed\")\n    else:\n        print(\"Zero offset failed\")\n        print(input)\n        print(output)\n\n\ndef check_gradient_dconv():\n\n    input = torch.rand(N, inC, inH, inW).cuda() * 0.01\n    input.requires_grad = True\n\n    offset = torch.randn(N, deformable_groups * 2 * kW * kH, inH, inW).cuda() * 2\n    # offset.data.zero_()\n    # offset.data -= 0.5\n    offset.requires_grad = True\n\n    mask = torch.rand(N, deformable_groups * 1 * kW * kH, inH, inW).cuda()\n    # mask.data.zero_()\n    mask.requires_grad = True\n    mask = torch.sigmoid(mask)\n\n    weight = torch.randn(outC, inC, kH, kW).cuda()\n    weight.requires_grad = True\n\n    bias = torch.rand(outC).cuda()\n    bias.requires_grad = True\n\n    stride = 1\n    padding = 1\n    dilation = 1\n\n    print(\n        \"check_gradient_dconv: \",\n        gradcheck(\n            dcn_v2_conv,\n            (input, offset, mask, weight, bias, stride, padding, dilation, deformable_groups),\n            eps=1e-3,\n            atol=1e-4,\n            rtol=1e-2,\n        ),\n    )\n\n\ndef check_pooling_zero_offset():\n\n    input = torch.randn(2, 16, 64, 64).cuda().zero_()\n    input[0, :, 16:26, 16:26] = 1.0\n    input[1, :, 10:20, 20:30] = 2.0\n    rois = (\n        torch.tensor(\n            [\n                [0, 65, 65, 103, 103],\n                [1, 81, 41, 119, 79],\n            ]\n        )\n        .cuda()\n        .float()\n    )\n    pooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=16,\n        no_trans=True,\n        group_size=1,\n        trans_std=0.0,\n    ).cuda()\n\n    out = pooling(input, rois, input.new())\n    s = \", \".join([\"%f\" % out[i, :, :, :].mean().item() for i in range(rois.shape[0])])\n    print(s)\n\n    dpooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=16,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.0,\n    ).cuda()\n    offset = torch.randn(20, 2, 7, 7).cuda().zero_()\n    dout = dpooling(input, rois, offset)\n    s = \", \".join([\"%f\" % dout[i, :, :, :].mean().item() for i in range(rois.shape[0])])\n    print(s)\n\n\ndef check_gradient_dpooling():\n    input = torch.randn(2, 3, 5, 5).cuda() * 0.01\n    N = 4\n    batch_inds = torch.randint(2, (N, 1)).cuda().float()\n    x = torch.rand((N, 1)).cuda().float() * 15\n    y = torch.rand((N, 1)).cuda().float() * 15\n    w = torch.rand((N, 1)).cuda().float() * 10\n    h = torch.rand((N, 1)).cuda().float() * 10\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n    offset = torch.randn(N, 2, 3, 3).cuda()\n    input.requires_grad = True\n    offset.requires_grad = True\n\n    spatial_scale = 1.0 / 4\n    pooled_size = 3\n    output_dim = 3\n    no_trans = 0\n    group_size = 1\n    trans_std = 0.0\n    sample_per_part = 4\n    part_size = pooled_size\n\n    print(\n        \"check_gradient_dpooling:\",\n        gradcheck(\n            dcn_v2_pooling,\n            (\n                input,\n                rois,\n                offset,\n                spatial_scale,\n                pooled_size,\n                output_dim,\n                no_trans,\n                group_size,\n                part_size,\n                sample_per_part,\n                trans_std,\n            ),\n            eps=1e-4,\n        ),\n    )\n\n\ndef example_dconv():\n    input = torch.randn(2, 64, 128, 128).cuda()\n    # wrap all things (offset and mask) in DCN\n    dcn = DCN(64, 64, kernel_size=(3, 3), stride=1, padding=1, deformable_groups=2).cuda()\n    # print(dcn.weight.shape, input.shape)\n    output = dcn(input)\n    targert = output.new(*output.size())\n    targert.data.uniform_(-0.01, 0.01)\n    error = (targert - output).mean()\n    error.backward()\n    print(output.shape)\n\n\ndef example_dpooling():\n    input = torch.randn(2, 32, 64, 64).cuda()\n    batch_inds = torch.randint(2, (20, 1)).cuda().float()\n    x = torch.randint(256, (20, 1)).cuda().float()\n    y = torch.randint(256, (20, 1)).cuda().float()\n    w = torch.randint(64, (20, 1)).cuda().float()\n    h = torch.randint(64, (20, 1)).cuda().float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n    offset = torch.randn(20, 2, 7, 7).cuda()\n    input.requires_grad = True\n    offset.requires_grad = True\n\n    # normal roi_align\n    pooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=True,\n        group_size=1,\n        trans_std=0.1,\n    ).cuda()\n\n    # deformable pooling\n    dpooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.1,\n    ).cuda()\n\n    out = pooling(input, rois, offset)\n    dout = dpooling(input, rois, offset)\n    print(out.shape)\n    print(dout.shape)\n\n    target_out = out.new(*out.size())\n    target_out.data.uniform_(-0.01, 0.01)\n    target_dout = dout.new(*dout.size())\n    target_dout.data.uniform_(-0.01, 0.01)\n    e = (target_out - out).mean()\n    e.backward()\n    e = (target_dout - dout).mean()\n    e.backward()\n\n\ndef example_mdpooling():\n    input = torch.randn(2, 32, 64, 64).cuda()\n    input.requires_grad = True\n    batch_inds = torch.randint(2, (20, 1)).cuda().float()\n    x = torch.randint(256, (20, 1)).cuda().float()\n    y = torch.randint(256, (20, 1)).cuda().float()\n    w = torch.randint(64, (20, 1)).cuda().float()\n    h = torch.randint(64, (20, 1)).cuda().float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n\n    # mdformable pooling (V2)\n    dpooling = DCNPooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.1,\n        deform_fc_dim=1024,\n    ).cuda()\n\n    dout = dpooling(input, rois)\n    target = dout.new(*dout.size())\n    target.data.uniform_(-0.1, 0.1)\n    error = (target - dout).mean()\n    error.backward()\n    print(dout.shape)\n\n\nif __name__ == \"__main__\":\n\n    example_dconv()\n    example_dpooling()\n    example_mdpooling()\n\n    check_pooling_zero_offset()\n    # zero offset check\n    if inC == outC:\n        check_zero_offset()\n\n    check_gradient_dpooling()\n    check_gradient_dconv()\n    # \"\"\"\n    # ****** Note: backward is not reentrant error may not be a serious problem,\n    # ****** since the max error is less than 1e-7,\n    # ****** Still looking for what trigger this problem\n    # \"\"\"\n"
  },
  {
    "path": "DCNv2/test/testcpu.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import absolute_import, division, print_function\n\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import gradcheck\n\nfrom dcn_v2 import DCN, DCNPooling, DCNv2, DCNv2Pooling, dcn_v2_conv, dcn_v2_pooling\n\ndeformable_groups = 1\nN, inC, inH, inW = 2, 2, 4, 4\noutC = 2\nkH, kW = 3, 3\n\n\ndef conv_identify(weight, bias):\n    weight.data.zero_()\n    bias.data.zero_()\n    o, i, h, w = weight.shape\n    y = h // 2\n    x = w // 2\n    for p in range(i):\n        for q in range(o):\n            if p == q:\n                weight.data[q, p, y, x] = 1.0\n\n\ndef check_zero_offset():\n    conv_offset = nn.Conv2d(\n        inC,\n        deformable_groups * 2 * kH * kW,\n        kernel_size=(kH, kW),\n        stride=(1, 1),\n        padding=(1, 1),\n        bias=True,\n    )\n\n    conv_mask = nn.Conv2d(\n        inC,\n        deformable_groups * 1 * kH * kW,\n        kernel_size=(kH, kW),\n        stride=(1, 1),\n        padding=(1, 1),\n        bias=True,\n    )\n\n    dcn_v2 = DCNv2(inC, outC, (kH, kW), stride=1, padding=1, dilation=1, deformable_groups=deformable_groups)\n\n    conv_offset.weight.data.zero_()\n    conv_offset.bias.data.zero_()\n    conv_mask.weight.data.zero_()\n    conv_mask.bias.data.zero_()\n    conv_identify(dcn_v2.weight, dcn_v2.bias)\n\n    input = torch.randn(N, inC, inH, inW)\n    offset = conv_offset(input)\n    mask = conv_mask(input)\n    mask = torch.sigmoid(mask)\n    output = dcn_v2(input, offset, mask)\n    output *= 2\n    d = (input - output).abs().max()\n    if d < 1e-10:\n        print(\"Zero offset passed\")\n    else:\n        print(\"Zero offset failed\")\n        print(input)\n        print(output)\n\n\ndef check_gradient_dconv():\n\n    input = torch.rand(N, inC, inH, inW) * 0.01\n    input.requires_grad = True\n\n    offset = torch.randn(N, deformable_groups * 2 * kW * kH, inH, inW) * 2\n    # offset.data.zero_()\n    # offset.data -= 0.5\n    offset.requires_grad = True\n\n    mask = torch.rand(N, deformable_groups * 1 * kW * kH, inH, inW)\n    # mask.data.zero_()\n    mask.requires_grad = True\n    mask = torch.sigmoid(mask)\n\n    weight = torch.randn(outC, inC, kH, kW)\n    weight.requires_grad = True\n\n    bias = torch.rand(outC)\n    bias.requires_grad = True\n\n    stride = 1\n    padding = 1\n    dilation = 1\n\n    print(\n        \"check_gradient_dconv: \",\n        gradcheck(\n            dcn_v2_conv,\n            (input, offset, mask, weight, bias, stride, padding, dilation, deformable_groups),\n            eps=1e-3,\n            atol=1e-4,\n            rtol=1e-2,\n        ),\n    )\n\n\ndef check_pooling_zero_offset():\n\n    input = torch.randn(2, 16, 64, 64).zero_()\n    input[0, :, 16:26, 16:26] = 1.0\n    input[1, :, 10:20, 20:30] = 2.0\n    rois = torch.tensor(\n        [\n            [0, 65, 65, 103, 103],\n            [1, 81, 41, 119, 79],\n        ]\n    ).float()\n    pooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=16,\n        no_trans=True,\n        group_size=1,\n        trans_std=0.0,\n    )\n\n    out = pooling(input, rois, input.new())\n    s = \", \".join([\"%f\" % out[i, :, :, :].mean().item() for i in range(rois.shape[0])])\n    print(s)\n\n    dpooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=16,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.0,\n    )\n    offset = torch.randn(20, 2, 7, 7).zero_()\n    dout = dpooling(input, rois, offset)\n    s = \", \".join([\"%f\" % dout[i, :, :, :].mean().item() for i in range(rois.shape[0])])\n    print(s)\n\n\ndef check_gradient_dpooling():\n    input = torch.randn(2, 3, 5, 5) * 0.01\n    N = 4\n    batch_inds = torch.randint(2, (N, 1)).float()\n    x = torch.rand((N, 1)).float() * 15\n    y = torch.rand((N, 1)).float() * 15\n    w = torch.rand((N, 1)).float() * 10\n    h = torch.rand((N, 1)).float() * 10\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n    offset = torch.randn(N, 2, 3, 3)\n    input.requires_grad = True\n    offset.requires_grad = True\n\n    spatial_scale = 1.0 / 4\n    pooled_size = 3\n    output_dim = 3\n    no_trans = 0\n    group_size = 1\n    trans_std = 0.0\n    sample_per_part = 4\n    part_size = pooled_size\n\n    print(\n        \"check_gradient_dpooling:\",\n        gradcheck(\n            dcn_v2_pooling,\n            (\n                input,\n                rois,\n                offset,\n                spatial_scale,\n                pooled_size,\n                output_dim,\n                no_trans,\n                group_size,\n                part_size,\n                sample_per_part,\n                trans_std,\n            ),\n            eps=1e-4,\n        ),\n    )\n\n\ndef example_dconv():\n    input = torch.randn(2, 64, 128, 128)\n    # wrap all things (offset and mask) in DCN\n    dcn = DCN(64, 64, kernel_size=(3, 3), stride=1, padding=1, deformable_groups=2)\n    # print(dcn.weight.shape, input.shape)\n    output = dcn(input)\n    targert = output.new(*output.size())\n    targert.data.uniform_(-0.01, 0.01)\n    error = (targert - output).mean()\n    error.backward()\n    print(output.shape)\n\n\ndef example_dpooling():\n    input = torch.randn(2, 32, 64, 64)\n    batch_inds = torch.randint(2, (20, 1)).float()\n    x = torch.randint(256, (20, 1)).float()\n    y = torch.randint(256, (20, 1)).float()\n    w = torch.randint(64, (20, 1)).float()\n    h = torch.randint(64, (20, 1)).float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n    offset = torch.randn(20, 2, 7, 7)\n    input.requires_grad = True\n    offset.requires_grad = True\n\n    # normal roi_align\n    pooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=True,\n        group_size=1,\n        trans_std=0.1,\n    )\n\n    # deformable pooling\n    dpooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.1,\n    )\n\n    out = pooling(input, rois, offset)\n    dout = dpooling(input, rois, offset)\n    print(out.shape)\n    print(dout.shape)\n\n    target_out = out.new(*out.size())\n    target_out.data.uniform_(-0.01, 0.01)\n    target_dout = dout.new(*dout.size())\n    target_dout.data.uniform_(-0.01, 0.01)\n    e = (target_out - out).mean()\n    e.backward()\n    e = (target_dout - dout).mean()\n    e.backward()\n\n\ndef example_mdpooling():\n    input = torch.randn(2, 32, 64, 64)\n    input.requires_grad = True\n    batch_inds = torch.randint(2, (20, 1)).float()\n    x = torch.randint(256, (20, 1)).float()\n    y = torch.randint(256, (20, 1)).float()\n    w = torch.randint(64, (20, 1)).float()\n    h = torch.randint(64, (20, 1)).float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n\n    # mdformable pooling (V2)\n    dpooling = DCNPooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.1,\n        deform_fc_dim=1024,\n    )\n\n    dout = dpooling(input, rois)\n    target = dout.new(*dout.size())\n    target.data.uniform_(-0.1, 0.1)\n    error = (target - dout).mean()\n    error.backward()\n    print(dout.shape)\n\n\nif __name__ == \"__main__\":\n\n    example_dconv()\n    example_dpooling()\n    example_mdpooling()\n\n    check_pooling_zero_offset()\n    # zero offset check\n    if inC == outC:\n        check_zero_offset()\n\n    check_gradient_dpooling()\n    check_gradient_dconv()\n    # \"\"\"\n    # ****** Note: backward is not reentrant error may not be a serious problem,\n    # ****** since the max error is less than 1e-7,\n    # ****** Still looking for what trigger this problem\n    # \"\"\"\n"
  },
  {
    "path": "DCNv2/test/testcuda.py",
    "content": "#!/usr/bin/env python\nfrom __future__ import absolute_import, division, print_function\n\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import gradcheck\n\nfrom dcn_v2 import DCN, DCNPooling, DCNv2, DCNv2Pooling, dcn_v2_conv, dcn_v2_pooling\n\ndeformable_groups = 1\nN, inC, inH, inW = 2, 2, 4, 4\noutC = 2\nkH, kW = 3, 3\n\n\ndef conv_identify(weight, bias):\n    weight.data.zero_()\n    bias.data.zero_()\n    o, i, h, w = weight.shape\n    y = h // 2\n    x = w // 2\n    for p in range(i):\n        for q in range(o):\n            if p == q:\n                weight.data[q, p, y, x] = 1.0\n\n\ndef check_zero_offset():\n    conv_offset = nn.Conv2d(\n        inC,\n        deformable_groups * 2 * kH * kW,\n        kernel_size=(kH, kW),\n        stride=(1, 1),\n        padding=(1, 1),\n        bias=True,\n    ).cuda()\n\n    conv_mask = nn.Conv2d(\n        inC,\n        deformable_groups * 1 * kH * kW,\n        kernel_size=(kH, kW),\n        stride=(1, 1),\n        padding=(1, 1),\n        bias=True,\n    ).cuda()\n\n    dcn_v2 = DCNv2(inC, outC, (kH, kW), stride=1, padding=1, dilation=1, deformable_groups=deformable_groups).cuda()\n\n    conv_offset.weight.data.zero_()\n    conv_offset.bias.data.zero_()\n    conv_mask.weight.data.zero_()\n    conv_mask.bias.data.zero_()\n    conv_identify(dcn_v2.weight, dcn_v2.bias)\n\n    input = torch.randn(N, inC, inH, inW).cuda()\n    offset = conv_offset(input)\n    mask = conv_mask(input)\n    mask = torch.sigmoid(mask)\n    output = dcn_v2(input, offset, mask)\n    output *= 2\n    d = (input - output).abs().max()\n    if d < 1e-10:\n        print(\"Zero offset passed\")\n    else:\n        print(\"Zero offset failed\")\n        print(input)\n        print(output)\n\n\ndef check_gradient_dconv():\n\n    input = torch.rand(N, inC, inH, inW).cuda() * 0.01\n    input.requires_grad = True\n\n    offset = torch.randn(N, deformable_groups * 2 * kW * kH, inH, inW).cuda() * 2\n    # offset.data.zero_()\n    # offset.data -= 0.5\n    offset.requires_grad = True\n\n    mask = torch.rand(N, deformable_groups * 1 * kW * kH, inH, inW).cuda()\n    # mask.data.zero_()\n    mask.requires_grad = True\n    mask = torch.sigmoid(mask)\n\n    weight = torch.randn(outC, inC, kH, kW).cuda()\n    weight.requires_grad = True\n\n    bias = torch.rand(outC).cuda()\n    bias.requires_grad = True\n\n    stride = 1\n    padding = 1\n    dilation = 1\n\n    print(\n        \"check_gradient_dconv: \",\n        gradcheck(\n            dcn_v2_conv,\n            (input, offset, mask, weight, bias, stride, padding, dilation, deformable_groups),\n            eps=1e-3,\n            atol=1e-4,\n            rtol=1e-2,\n        ),\n    )\n\n\ndef check_pooling_zero_offset():\n\n    input = torch.randn(2, 16, 64, 64).cuda().zero_()\n    input[0, :, 16:26, 16:26] = 1.0\n    input[1, :, 10:20, 20:30] = 2.0\n    rois = (\n        torch.tensor(\n            [\n                [0, 65, 65, 103, 103],\n                [1, 81, 41, 119, 79],\n            ]\n        )\n        .cuda()\n        .float()\n    )\n    pooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=16,\n        no_trans=True,\n        group_size=1,\n        trans_std=0.0,\n    ).cuda()\n\n    out = pooling(input, rois, input.new())\n    s = \", \".join([\"%f\" % out[i, :, :, :].mean().item() for i in range(rois.shape[0])])\n    print(s)\n\n    dpooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=16,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.0,\n    ).cuda()\n    offset = torch.randn(20, 2, 7, 7).cuda().zero_()\n    dout = dpooling(input, rois, offset)\n    s = \", \".join([\"%f\" % dout[i, :, :, :].mean().item() for i in range(rois.shape[0])])\n    print(s)\n\n\ndef check_gradient_dpooling():\n    input = torch.randn(2, 3, 5, 5).cuda().float() * 0.01\n    N = 4\n    batch_inds = torch.randint(2, (N, 1)).cuda().float()\n    x = torch.rand((N, 1)).cuda().float() * 15\n    y = torch.rand((N, 1)).cuda().float() * 15\n    w = torch.rand((N, 1)).cuda().float() * 10\n    h = torch.rand((N, 1)).cuda().float() * 10\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n    offset = torch.randn(N, 2, 3, 3).cuda()\n    input.requires_grad = True\n    offset.requires_grad = True\n\n    spatial_scale = 1.0 / 4\n    pooled_size = 3\n    output_dim = 3\n    no_trans = 0\n    group_size = 1\n    trans_std = 0.0\n    sample_per_part = 4\n    part_size = pooled_size\n\n    print(\n        \"check_gradient_dpooling:\",\n        gradcheck(\n            dcn_v2_pooling,\n            (\n                input,\n                rois,\n                offset,\n                spatial_scale,\n                pooled_size,\n                output_dim,\n                no_trans,\n                group_size,\n                part_size,\n                sample_per_part,\n                trans_std,\n            ),\n            eps=1e-4,\n        ),\n    )\n\n\ndef example_dconv():\n    input = torch.randn(2, 64, 128, 128).cuda()\n    # wrap all things (offset and mask) in DCN\n    dcn = DCN(64, 64, kernel_size=(3, 3), stride=1, padding=1, deformable_groups=2).cuda()\n    # print(dcn.weight.shape, input.shape)\n    output = dcn(input)\n    targert = output.new(*output.size())\n    targert.data.uniform_(-0.01, 0.01)\n    error = (targert - output).mean()\n    error.backward()\n    print(output.shape)\n\n\ndef example_dpooling():\n    input = torch.randn(2, 32, 64, 64).cuda()\n    batch_inds = torch.randint(2, (20, 1)).cuda().float()\n    x = torch.randint(256, (20, 1)).cuda().float()\n    y = torch.randint(256, (20, 1)).cuda().float()\n    w = torch.randint(64, (20, 1)).cuda().float()\n    h = torch.randint(64, (20, 1)).cuda().float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n    offset = torch.randn(20, 2, 7, 7).cuda()\n    input.requires_grad = True\n    offset.requires_grad = True\n\n    # normal roi_align\n    pooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=True,\n        group_size=1,\n        trans_std=0.1,\n    ).cuda()\n\n    # deformable pooling\n    dpooling = DCNv2Pooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.1,\n    ).cuda()\n\n    out = pooling(input, rois, offset)\n    dout = dpooling(input, rois, offset)\n    print(out.shape)\n    print(dout.shape)\n\n    target_out = out.new(*out.size())\n    target_out.data.uniform_(-0.01, 0.01)\n    target_dout = dout.new(*dout.size())\n    target_dout.data.uniform_(-0.01, 0.01)\n    e = (target_out - out).mean()\n    e.backward()\n    e = (target_dout - dout).mean()\n    e.backward()\n\n\ndef example_mdpooling():\n    input = torch.randn(2, 32, 64, 64).cuda()\n    input.requires_grad = True\n    batch_inds = torch.randint(2, (20, 1)).cuda().float()\n    x = torch.randint(256, (20, 1)).cuda().float()\n    y = torch.randint(256, (20, 1)).cuda().float()\n    w = torch.randint(64, (20, 1)).cuda().float()\n    h = torch.randint(64, (20, 1)).cuda().float()\n    rois = torch.cat((batch_inds, x, y, x + w, y + h), dim=1)\n\n    # mdformable pooling (V2)\n    dpooling = DCNPooling(\n        spatial_scale=1.0 / 4,\n        pooled_size=7,\n        output_dim=32,\n        no_trans=False,\n        group_size=1,\n        trans_std=0.1,\n        deform_fc_dim=1024,\n    ).cuda()\n\n    dout = dpooling(input, rois)\n    target = dout.new(*dout.size())\n    target.data.uniform_(-0.1, 0.1)\n    error = (target - dout).mean()\n    error.backward()\n    print(dout.shape)\n\n\nif __name__ == \"__main__\":\n\n    example_dconv()\n    example_dpooling()\n    example_mdpooling()\n\n    check_pooling_zero_offset()\n    # zero offset check\n    if inC == outC:\n        check_zero_offset()\n\n    check_gradient_dpooling()\n    check_gradient_dconv()\n    # \"\"\"\n    # ****** Note: backward is not reentrant error may not be a serious problem,\n    # ****** since the max error is less than 1e-7,\n    # ****** Still looking for what trigger this problem\n    # \"\"\"\n"
  },
  {
    "path": "LICENSE",
    "content": "                                 Apache License\n                           Version 2.0, January 2004\n                        http://www.apache.org/licenses/\n\n   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION\n\n   1. Definitions.\n\n      \"License\" shall mean the terms and conditions for use, reproduction,\n      and distribution as defined by Sections 1 through 9 of this document.\n\n      \"Licensor\" shall mean the copyright owner or entity authorized by\n      the copyright owner that is granting the License.\n\n      \"Legal Entity\" shall mean the union of the acting entity and all\n      other entities that control, are controlled by, or are under common\n      control with that entity. For the purposes of this definition,\n      \"control\" means (i) the power, direct or indirect, to cause the\n      direction or management of such entity, whether by contract or\n      otherwise, or (ii) ownership of fifty percent (50%) or more of the\n      outstanding shares, or (iii) beneficial ownership of such entity.\n\n      \"You\" (or \"Your\") shall mean an individual or Legal Entity\n      exercising permissions granted by this License.\n\n      \"Source\" form shall mean the preferred form for making modifications,\n      including but not limited to software source code, documentation\n      source, and configuration files.\n\n      \"Object\" form shall mean any form resulting from mechanical\n      transformation or translation of a Source form, including but\n      not limited to compiled object code, generated documentation,\n      and conversions to other media types.\n\n      \"Work\" shall mean the work of authorship, whether in Source or\n      Object form, made available under the License, as indicated by a\n      copyright notice that is included in or attached to the work\n      (an example is provided in the Appendix below).\n\n      \"Derivative Works\" shall mean any work, whether in Source or Object\n      form, that is based on (or derived from) the Work and for which the\n      editorial revisions, annotations, elaborations, or other modifications\n      represent, as a whole, an original work of authorship. For the purposes\n      of this License, Derivative Works shall not include works that remain\n      separable from, or merely link (or bind by name) to the interfaces of,\n      the Work and Derivative Works thereof.\n\n      \"Contribution\" shall mean any work of authorship, including\n      the original version of the Work and any modifications or additions\n      to that Work or Derivative Works thereof, that is intentionally\n      submitted to Licensor for inclusion in the Work by the copyright owner\n      or by an individual or Legal Entity authorized to submit on behalf of\n      the copyright owner. For the purposes of this definition, \"submitted\"\n      means any form of electronic, verbal, or written communication sent\n      to the Licensor or its representatives, including but not limited to\n      communication on electronic mailing lists, source code control systems,\n      and issue tracking systems that are managed by, or on behalf of, the\n      Licensor for the purpose of discussing and improving the Work, but\n      excluding communication that is conspicuously marked or otherwise\n      designated in writing by the copyright owner as \"Not a Contribution.\"\n\n      \"Contributor\" shall mean Licensor and any individual or Legal Entity\n      on behalf of whom a Contribution has been received by Licensor and\n      subsequently incorporated within the Work.\n\n   2. Grant of Copyright License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      copyright license to reproduce, prepare Derivative Works of,\n      publicly display, publicly perform, sublicense, and distribute the\n      Work and such Derivative Works in Source or Object form.\n\n   3. Grant of Patent License. Subject to the terms and conditions of\n      this License, each Contributor hereby grants to You a perpetual,\n      worldwide, non-exclusive, no-charge, royalty-free, irrevocable\n      (except as stated in this section) patent license to make, have made,\n      use, offer to sell, sell, import, and otherwise transfer the Work,\n      where such license applies only to those patent claims licensable\n      by such Contributor that are necessarily infringed by their\n      Contribution(s) alone or by combination of their Contribution(s)\n      with the Work to which such Contribution(s) was submitted. If You\n      institute patent litigation against any entity (including a\n      cross-claim or counterclaim in a lawsuit) alleging that the Work\n      or a Contribution incorporated within the Work constitutes direct\n      or contributory patent infringement, then any patent licenses\n      granted to You under this License for that Work shall terminate\n      as of the date such litigation is filed.\n\n   4. Redistribution. You may reproduce and distribute copies of the\n      Work or Derivative Works thereof in any medium, with or without\n      modifications, and in Source or Object form, provided that You\n      meet the following conditions:\n\n      (a) You must give any other recipients of the Work or\n          Derivative Works a copy of this License; and\n\n      (b) You must cause any modified files to carry prominent notices\n          stating that You changed the files; and\n\n      (c) You must retain, in the Source form of any Derivative Works\n          that You distribute, all copyright, patent, trademark, and\n          attribution notices from the Source form of the Work,\n          excluding those notices that do not pertain to any part of\n          the Derivative Works; and\n\n      (d) If the Work includes a \"NOTICE\" text file as part of its\n          distribution, then any Derivative Works that You distribute must\n          include a readable copy of the attribution notices contained\n          within such NOTICE file, excluding those notices that do not\n          pertain to any part of the Derivative Works, in at least one\n          of the following places: within a NOTICE text file distributed\n          as part of the Derivative Works; within the Source form or\n          documentation, if provided along with the Derivative Works; or,\n          within a display generated by the Derivative Works, if and\n          wherever such third-party notices normally appear. The contents\n          of the NOTICE file are for informational purposes only and\n          do not modify the License. You may add Your own attribution\n          notices within Derivative Works that You distribute, alongside\n          or as an addendum to the NOTICE text from the Work, provided\n          that such additional attribution notices cannot be construed\n          as modifying the License.\n\n      You may add Your own copyright statement to Your modifications and\n      may provide additional or different license terms and conditions\n      for use, reproduction, or distribution of Your modifications, or\n      for any such Derivative Works as a whole, provided Your use,\n      reproduction, and distribution of the Work otherwise complies with\n      the conditions stated in this License.\n\n   5. Submission of Contributions. Unless You explicitly state otherwise,\n      any Contribution intentionally submitted for inclusion in the Work\n      by You to the Licensor shall be under the terms and conditions of\n      this License, without any additional terms or conditions.\n      Notwithstanding the above, nothing herein shall supersede or modify\n      the terms of any separate license agreement you may have executed\n      with Licensor regarding such Contributions.\n\n   6. Trademarks. This License does not grant permission to use the trade\n      names, trademarks, service marks, or product names of the Licensor,\n      except as required for reasonable and customary use in describing the\n      origin of the Work and reproducing the content of the NOTICE file.\n\n   7. Disclaimer of Warranty. Unless required by applicable law or\n      agreed to in writing, Licensor provides the Work (and each\n      Contributor provides its Contributions) on an \"AS IS\" BASIS,\n      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or\n      implied, including, without limitation, any warranties or conditions\n      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A\n      PARTICULAR PURPOSE. You are solely responsible for determining the\n      appropriateness of using or redistributing the Work and assume any\n      risks associated with Your exercise of permissions under this License.\n\n   8. Limitation of Liability. In no event and under no legal theory,\n      whether in tort (including negligence), contract, or otherwise,\n      unless required by applicable law (such as deliberate and grossly\n      negligent acts) or agreed to in writing, shall any Contributor be\n      liable to You for damages, including any direct, indirect, special,\n      incidental, or consequential damages of any character arising as a\n      result of this License or out of the use or inability to use the\n      Work (including but not limited to damages for loss of goodwill,\n      work stoppage, computer failure or malfunction, or any and all\n      other commercial damages or losses), even if such Contributor\n      has been advised of the possibility of such damages.\n\n   9. Accepting Warranty or Additional Liability. While redistributing\n      the Work or Derivative Works thereof, You may choose to offer,\n      and charge a fee for, acceptance of support, warranty, indemnity,\n      or other liability obligations and/or rights consistent with this\n      License. However, in accepting such obligations, You may act only\n      on Your own behalf and on Your sole responsibility, not on behalf\n      of any other Contributor, and only if You agree to indemnify,\n      defend, and hold each Contributor harmless for any liability\n      incurred by, or claims asserted against, such Contributor by reason\n      of your accepting any such warranty or additional liability.\n\n   END OF TERMS AND CONDITIONS\n\n   APPENDIX: How to apply the Apache License to your work.\n\n      To apply the Apache License to your work, attach the following\n      boilerplate notice, with the fields enclosed by brackets \"[]\"\n      replaced with your own identifying information. (Don't include\n      the brackets!)  The text should be enclosed in the appropriate\n      comment syntax for the file format. We also recommend that a\n      file or class name and description of purpose be included on the\n      same \"printed page\" as the copyright notice for easier\n      identification within third-party archives.\n\n   Copyright [yyyy] [name of copyright owner]\n\n   Licensed under the Apache License, Version 2.0 (the \"License\");\n   you may not use this file except in compliance with the License.\n   You may obtain a copy of the License at\n\n       http://www.apache.org/licenses/LICENSE-2.0\n\n   Unless required by applicable law or agreed to in writing, software\n   distributed under the License is distributed on an \"AS IS\" BASIS,\n   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n   See the License for the specific language governing permissions and\n   limitations under the License.\n"
  },
  {
    "path": "README.md",
    "content": "# FaPN: Feature-aligned Pyramid Network for Dense Image Prediction [[arXiv]](https://arxiv.org/pdf/2108.07058.pdf) [[Project Page]](http://www.shihuahuang.cn/fapn/)\n\n```BibTex\n@inproceedings{\n  huang2021fapn,\n  title={{FaPN}: Feature-aligned Pyramid Network for Dense Image Prediction},\n  author={Shihua Huang and Zhichao Lu and Ran Cheng and Cheng He},\n  booktitle={International Conference on Computer Vision (ICCV)},\n  year={2021}\n}\n```\n## Overview\n\nFaPN vs. FPN           |  Before vs. After Alignment\n:-------------------------:|:-------------------------:\n<img width=\"380\" src=\"./assert/fpn_vs_fapn.png\"> |  <img width=\"400\" src=\"./assert/feat_vis.png\">\n\nThis project provides the official implementation for our ICCV2021 paper \n\"[FaPN: Feature-aligned Pyramid Network for Dense Image Prediction](https://arxiv.org/pdf/2108.07058.pdf)\" \nbased on [Detectron2](https://github.com/facebookresearch/detectron2). \nFaPN is a simple yet effective top-down pyramidal architecture to generate multi-scale features for dense image prediction.\nComprised of a feature alignment module (FAM) and a feature selection module (FSM), FaPN addresses the issue of feature alignment\nin  the original [FPN](https://arxiv.org/abs/1612.03144), leading to substaintial improvements on various dense prediction tasks, such as object detection, semantic, instance, panoptic segmentation, etc. \n\n\n## Installation\nThis project is based on [Detectron2](https://github.com/facebookresearch/detectron2), which can be constructed as follows.\n* Install Detectron2 following [the instructions](https://detectron2.readthedocs.io/tutorials/install.html).\n* Setup the dataset following [the structure](https://github.com/facebookresearch/detectron2/blob/master/datasets/README.md).\n* Copy this project to `/path/to/detectron2`\n* Install DCNv2 following [Install DCNv2.md](./DCNv2/README.md).\n\n## Training\nTo train a model with 8 GPUs, run:\n```bash\ncd /path/to/detectron2/tools\npython3 train_net.py --config-file <config.yaml> --num-gpus 8\n```\n\nFor example, to launch Faster R-CNN training (1x schedule) with ResNet-50 backbone on 8 GPUs,\none should execute:\n```bash\ncd /path/to/detectron2/tools\npython3 train_net.py --config-file ../configs\\COCO-Detection\\faster_rcnn_R_50_FAN_1x.yaml --num-gpus 8\n```\n\n## Evaluation\nTo evaluate a pre-trained model with 8 GPUs, run:\n```bash\ncd /path/to/detectron2/tools\npython3 train_net.py --config-file <config.yaml> --num-gpus 8 --eval-only MODEL.WEIGHTS /path/to/model_checkpoint\n```\n\n## Results\n### COCO Object Detection\n#### Faster R-CNN + FaPN:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Name</th>\n<th valign=\"bottom\">lr<br/>sched</th>\n<th valign=\"bottom\">box<br/>AP</th>\n<th valign=\"bottom\">box<br/>APs</th>\n<th valign=\"bottom\">box<br/>APm</th>\n<th valign=\"bottom\">box<br/>APl</th>\n<th valign=\"bottom\">download</th>\n<!-- TABLE BODY -->\n<!-- ROW: faster_rcnn_R_50_FAN_1x -->\n <tr><td align=\"left\"><a href=\"configs/COCO-Detection/faster_rcnn_R_50_FAN_1x.yaml\">R50</a></td>\n<td align=\"center\">1x</td>\n<td align=\"center\">39.2</td>\n<td align=\"center\">24.5</td>\n<td align=\"center\">43.3</td>\n<td align=\"center\">49.1</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/16bws3mM-itTMBZvbBoBaJIm8bW7jLrTl/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1cP0JJ98zNbqXDfx2g12qEF3i9wqyxzet/view?usp=sharing\">log</a></td>\n</tr>\n <tr><td align=\"left\"><a href=\"configs/COCO-Detection/faster_rcnn_R_101_FAN_3x.yaml\">R101</a></td>\n<td align=\"center\">3x</td>\n<td align=\"center\">42.8</td>\n<td align=\"center\">27.0</td>\n<td align=\"center\">46.2</td>\n<td align=\"center\">54.9</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1KioARI3Be2LPG1MdIgiQeAL_KIlRXhNP/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1a_8yvjIbV_uaNYKsN9sPhblcceHHG7SC/view?usp=sharing\">log</a></td>\n</tr>\n</tbody></table>\n\n### Cityscapes Semantic Segmentation\n#### PointRend + FaPN:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Name</th>\n<th valign=\"bottom\">lr<br/>sched</th>\n<th valign=\"bottom\">mask<br/>mIoU</th>\n<th valign=\"bottom\">mask<br/>i_IoU</th>\n<th valign=\"bottom\">mask<br/>IoU_sup</th>\n<th valign=\"bottom\">mask<br/>iIoU_sup</th>\n<th valign=\"bottom\">download</th>\n<!-- TABLE BODY -->\n<!-- ROW: faster_rcnn_R_50_FAN_1x -->\n <tr><td align=\"left\"><a href=\"./projects/PointRend/configs/SemanticSegmentation/pointrend_semantic_R_50_FAN_1x_cityscapes.yaml\">R50</a></td>\n<td align=\"center\">1x</td>\n<td align=\"center\">80.0</td>\n<td align=\"center\">61.3</td>\n<td align=\"center\">90.6</td>\n<td align=\"center\">78.5</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1R6af03eqnUufmYl7cf-eixbI_En8WN-8/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1i7p9RLLF_CpHNxcY5WwlKYY8h9ANGdEs/view?usp=sharing\">log</a></td>\n</tr>\n <tr><td align=\"left\"><a href=\"./projects/PointRend/configs/SemanticSegmentation/pointrend_semantic_R_101_FAN_1x_cityscapes.yaml\">R101</a></td>\n<td align=\"center\">1x</td>\n<td align=\"center\">80.1</td>\n<td align=\"center\">62.2</td>\n<td align=\"center\">90.8</td>\n<td align=\"center\">78.6</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1JSg9hweCIYZOhSceZAeF6CcbqIAiLKfr/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1M_MUJXNbiHYlN2D9m1kxfM1KXGb2C3E0/view?usp=sharing\">log</a></td>\n</tr>\n</tbody></table>\n\n### COCO Instance Segmentation\n#### Mask R-CNN + FaPN:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Name</th>\n<th valign=\"bottom\">lr<br/>sched</th>\n<th valign=\"bottom\">mask<br/>AP</th>\n<th valign=\"bottom\">mask<br/>APs</th>\n<th valign=\"bottom\">box<br/>AP</th>\n<th valign=\"bottom\">box<br/>APs</th>\n<th valign=\"bottom\">download</th>\n<!-- TABLE BODY -->\n <tr><td align=\"left\"><a href=\"./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FAN_1x.yaml\">R50</a></td>\n<td align=\"center\">1x</td>\n<td align=\"center\">36.4</td>\n<td align=\"center\">18.1</td>\n<td align=\"center\">39.8</td>\n<td align=\"center\">24.3</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1fNQw3v2d6C9BI3UF34iRqaWp2W48-Hl4/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1BC2Fgex5s7biuTeBM0WpTJul_FyIdObq/view?usp=sharing\">log</a></td>\n</tr>\n <tr><td align=\"left\"><a href=\"./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FAN_1x.yaml\">R101</a></td>\n<td align=\"center\">3x</td>\n<td align=\"center\">39.4</td>\n<td align=\"center\">20.9</td>\n<td align=\"center\">43.8</td>\n<td align=\"center\">27.4</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1MMWu_Bj_nrgiXwACJArcUR5G0iKmVKRA/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1QzN5_4ylskbTv4aTbMEJ1pNcK14zGQ2u/view?usp=sharing\">log</a></td>\n</tr>\n</tbody></table>\n\n#### PointRend + FaPN:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Name</th>\n<th valign=\"bottom\">lr<br/>sched</th>\n<th valign=\"bottom\">mask<br/>AP</th>\n<th valign=\"bottom\">mask<br/>APs</th>\n<th valign=\"bottom\">box<br/>AP</th>\n<th valign=\"bottom\">box<br/>APs</th>\n<th valign=\"bottom\">download</th>\n<!-- TABLE BODY -->\n <tr><td align=\"left\"><a href=\"./projects/PointRend/configs/SemanticSegmentation/pointrend_semantic_R_101_FAN_1x_cityscapes.yaml\">R50</a></td>\n<td align=\"center\">1x</td>\n<td align=\"center\">37.6</td>\n<td align=\"center\">18.6</td>\n<td align=\"center\">39.4</td>\n<td align=\"center\">24.2</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1EHTQJ4F2RdPBiXno97SJyP2FDZz-roCY/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1AqznSsh6Srfh0IHHJkXD1opFXU5TQ3_-/view?usp=sharing\">log</a></td>\n</tr>\n</tbody></table>\n\n\n### COCO Panoptic Segmentation\n#### PanopticFPN + FaPN:\n<table><tbody>\n<!-- START TABLE -->\n<!-- TABLE HEADER -->\n<th valign=\"bottom\">Name</th>\n<th valign=\"bottom\">lr<br/>sched</th>\n<th valign=\"bottom\">PQ</th>\n<th valign=\"bottom\">mask<br/>mIoU</th>\n<th valign=\"bottom\">St<br/>PQ</th>\n<th valign=\"bottom\">box<br/>AP</th>\n<th valign=\"bottom\">Th<br/>PQ</th>\n<th valign=\"bottom\">download</th>\n<!-- TABLE BODY -->\n <tr><td align=\"left\"><a href=\"./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FAN_1x.yaml\">R50</a></td>\n<td align=\"center\">1x</td>\n<td align=\"center\">41.1</td>\n<td align=\"center\">43.4</td>\n<td align=\"center\">32.5</td>\n<td align=\"center\">38.7</td>\n<td align=\"center\">46.9</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1XNhvGGbfxTz_kU3VSjLQ5jrKQn_a_4dE/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/1AqPRCn7dD9MQR3GX06tvT-oPn6E7giJM/view?usp=sharing\">log</a></td>\n</tr>\n <tr><td align=\"left\"><a href=\"./configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FAN_1x.yaml\">R101</a></td>\n<td align=\"center\">3x</td>\n<td align=\"center\">44.2</td>\n<td align=\"center\">45.7</td>\n<td align=\"center\">35.0</td>\n<td align=\"center\">43.0</td>\n<td align=\"center\">53.3</td>\n<td align=\"center\"><a href=\"https://drive.google.com/file/d/1buNmJEETxZmAnjhZCz4WqF5pSc9ezPow/view?usp=sharing\">model</a>&nbsp;|&nbsp;\n<a href=\"https://drive.google.com/file/d/106WqJEdRbbuKQa2eZW8Zwf3ucgARkz7K/view?usp=sharing\">log</a></td>\n</tr>\n</tbody></table>\n"
  },
  {
    "path": "configs/Base-RCNN-FAN.yaml",
    "content": "MODEL:\n  META_ARCHITECTURE: \"GeneralizedRCNN\"\n  BACKBONE:\n    NAME: \"build_resnet_fan_backbone\"   # build_resnet_fan_backbone\n  RESNETS:\n    OUT_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n  FPN:\n    IN_FEATURES: [\"res2\", \"res3\", \"res4\", \"res5\"]\n  ANCHOR_GENERATOR:\n    SIZES: [[32], [64], [128], [256], [512]]  # One size for each in feature map\n    ASPECT_RATIOS: [[0.5, 1.0, 2.0]]  # Three aspect ratios (same for all in feature maps)\n  RPN:\n    IN_FEATURES: [\"p2\", \"p3\", \"p4\", \"p5\", \"p6\"]\n    PRE_NMS_TOPK_TRAIN: 2000  # Per FPN level\n    PRE_NMS_TOPK_TEST: 1000  # Per FPN level\n    # Detectron1 uses 2000 proposals per-batch,\n    # (See \"modeling/rpn/rpn_outputs.py\" for details of this legacy issue)\n    # which is approximately 1000 proposals per-image since the default batch size for FPN is 2.\n    POST_NMS_TOPK_TRAIN: 1000\n    POST_NMS_TOPK_TEST: 1000\n  ROI_HEADS:\n    NAME: \"StandardROIHeads\"\n    IN_FEATURES: [\"p2\", \"p3\", \"p4\", \"p5\"]\n  ROI_BOX_HEAD:\n    NAME: \"FastRCNNConvFCHead\"\n    NUM_FC: 2\n    POOLER_RESOLUTION: 7\n  ROI_MASK_HEAD:\n    NAME: \"MaskRCNNConvUpsampleHead\"\n    NUM_CONV: 4\n    POOLER_RESOLUTION: 14\nDATASETS:\n  TRAIN: (\"coco_2017_train\",)\n  TEST: (\"coco_2017_val\",)\nSOLVER:\n  IMS_PER_BATCH: 16\n  BASE_LR: 0.02\n  STEPS: (60000, 80000)\n  MAX_ITER: 90000\nINPUT:\n  MIN_SIZE_TRAIN: (640, 672, 704, 736, 768, 800)\nVERSION: 2\n"
  },
  {
    "path": "configs/COCO-Detection/faster_rcnn_R_101_FAN_3x.yaml",
    "content": "_BASE_: \"../Base-RCNN-FAN.yaml\"\nMODEL:\n  WEIGHTS: \"detectron2://ImageNetPretrained/MSRA/R-101.pkl\"\n#  WEIGHTS: \"path/faster_rcnn_r101_3x_fan/model_final.pth\"\n  MASK_ON: False\n  RESNETS:\n    DEPTH: 101\nSOLVER:\n  STEPS: (210000, 250000)\n  MAX_ITER: 270000"
  },
  {
    "path": "configs/COCO-Detection/faster_rcnn_R_50_FAN_1x.yaml",
    "content": "_BASE_: \"../Base-RCNN-FAN.yaml\"\nMODEL:\n  WEIGHTS: \"detectron2://ImageNetPretrained/MSRA/R-50.pkl\"\n#  WEIGHTS: \"path/faster_rcnn_r50_1x_fan/model_final.pth\"\n  MASK_ON: False\n  RESNETS:\n    DEPTH: 50\n"
  },
  {
    "path": "configs/COCO-InstanceSegmentation/mask_rcnn_R_101_FAN_3x.yaml",
    "content": "_BASE_: \"../Base-RCNN-FAN.yaml\"\nMODEL:\n  WEIGHTS: \"detectron2://ImageNetPretrained/MSRA/R-101.pkl\"\n#  WEIGHTS: \"path/mask_rcnn_r101_3x_fan/model_final.pth\"\n  MASK_ON: True\n  RESNETS:\n    DEPTH: 101\n    \nSOLVER:\n  STEPS: (210000, 250000)\n  MAX_ITER: 270000"
  },
  {
    "path": "configs/COCO-InstanceSegmentation/mask_rcnn_R_50_FAN_1x.yaml",
    "content": "_BASE_: \"../Base-RCNN-FAN.yaml\"\nMODEL:\n  WEIGHTS: \"detectron2://ImageNetPretrained/MSRA/R-50.pkl\"\n#  WEIGHTS: \"/home/cseadmin/huangsh/codes/detectron2/tools/mask_rcnn_r50_1x_fan/model_final.pth\"\n  MASK_ON: True\n  RESNETS:\n    DEPTH: 50"
  },
  {
    "path": "configs/COCO-PanopticSegmentation/Base-Panoptic-FAN.yaml",
    "content": "_BASE_: \"../Base-RCNN-FAN.yaml\"\nMODEL:\n  META_ARCHITECTURE: \"PanopticFPN\"\n  MASK_ON: True\n  SEM_SEG_HEAD:\n    LOSS_WEIGHT: 0.5\nDATASETS:\n  TRAIN: (\"coco_2017_train_panoptic_separated\",)\n  TEST: (\"coco_2017_val_panoptic_separated\",)\nDATALOADER:\n  FILTER_EMPTY_ANNOTATIONS: False\n"
  },
  {
    "path": "configs/COCO-PanopticSegmentation/panoptic_fan_R_101_3x.yaml",
    "content": "_BASE_: \"Base-Panoptic-FAN.yaml\"\nMODEL:\n  WEIGHTS: \"detectron2://ImageNetPretrained/MSRA/R-101.pkl\"\n#  WEIGHTS: \"path/panoptic_r101_3x_fan/model_final.pth\"\n  RESNETS:\n    DEPTH: 101\nSOLVER:\n  STEPS: (210000, 250000)\n  MAX_ITER: 270000"
  },
  {
    "path": "configs/COCO-PanopticSegmentation/panoptic_fan_R_50_1x.yaml",
    "content": "_BASE_: \"Base-Panoptic-FAN.yaml\"\nMODEL:\n  WEIGHTS: \"detectron2://ImageNetPretrained/MSRA/R-50.pkl\"\n#  WEIGHTS: \"path/panoptic_r50_1x_fan/model_final.pth\"\n  RESNETS:\n    DEPTH: 50"
  },
  {
    "path": "detectron2/modeling/backbone/__init__.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates.\nfrom .build import build_backbone, BACKBONE_REGISTRY  # noqa F401 isort:skip\n\nfrom .backbone import Backbone\nfrom .fpn import FPN\nfrom .regnet import RegNet\nfrom .fan import FAN\nfrom .resnet import (\n    BasicStem,\n    ResNet,\n    ResNetBlockBase,\n    build_resnet_backbone,\n    make_stage,\n    BottleneckBlock,\n)\n\n__all__ = [k for k in globals().keys() if not k.startswith(\"_\")]\n# TODO can expose more resnet blocks after careful consideration\n"
  },
  {
    "path": "detectron2/modeling/backbone/fan.py",
    "content": "# Copyright (c) Facebook, Inc. and its affiliates. All Rights Reserved\nimport math\nimport fvcore.nn.weight_init as weight_init\nimport torch.nn.functional as F\nimport torch\nfrom torch import nn\nimport os\nimport torch\nimport torchvision as tv\nimport torchvision.transforms as transforms\nimport torch.nn as nn\nimport numpy as np\nimport cv2\nimport PIL\nfrom PIL import Image\nimport matplotlib.pyplot as plt\nimport matplotlib.cm as mpl_color_map\n\nfrom detectron2.layers import Conv2d, ShapeSpec, get_norm\n\nfrom .backbone import Backbone\nfrom .build import BACKBONE_REGISTRY\nfrom .resnet import build_resnet_backbone\n\n__all__ = [\"build_resnet_fan_backbone\", \"build_retinanet_resnet_fan_backbone\", \"FAN\"]\n\n# from dcn_v2 import DCN, DCNPooling, DCNv2, DCNv2Pooling, dcn_v2_conv, dcn_v2_pooling\nfrom dcn_v2 import DCN as dcn_v2\nfrom detectron2.layers import (CNNBlockBase, Conv2d, DeformConv, ModulatedDeformConv, ShapeSpec, get_norm, )\n\n\nclass FeatureSelectionModule(nn.Module):\n    def __init__(self, in_chan, out_chan, norm=\"GN\"):\n        super(FeatureSelectionModule, self).__init__()\n        self.conv_atten = Conv2d(in_chan, in_chan, kernel_size=1, bias=False, norm=get_norm(norm, in_chan))\n        self.sigmoid = nn.Sigmoid()\n        self.conv = Conv2d(in_chan, out_chan, kernel_size=1, bias=False, norm=get_norm('', out_chan))\n        weight_init.c2_xavier_fill(self.conv_atten)\n        weight_init.c2_xavier_fill(self.conv)\n\n    def forward(self, x):\n        atten = self.sigmoid(self.conv_atten(F.avg_pool2d(x, x.size()[2:])))\n        feat = torch.mul(x, atten)\n        x = x + feat\n        feat = self.conv(x)\n        return feat\n\n\nclass FeatureAlign_V2(nn.Module):  # FaPN full version\n    def __init__(self, in_nc=128, out_nc=128, norm=None):\n        super(FeatureAlign_V2, self).__init__()\n        self.lateral_conv = FeatureSelectionModule(in_nc, out_nc, norm=\"\")\n        self.offset = Conv2d(out_nc * 2, out_nc, kernel_size=1, stride=1, padding=0, bias=False, norm=norm)\n        self.dcpack_L2 = dcn_v2(out_nc, out_nc, 3, stride=1, padding=1, dilation=1, deformable_groups=8,\n                                extra_offset_mask=True)\n        self.relu = nn.ReLU(inplace=True)\n        weight_init.c2_xavier_fill(self.offset)\n\n    def forward(self, feat_l, feat_s, main_path=None):\n        HW = feat_l.size()[2:]\n        if feat_l.size()[2:] != feat_s.size()[2:]:\n            feat_up = F.interpolate(feat_s, HW, mode='bilinear', align_corners=False)\n        else:\n            feat_up = feat_s\n        feat_arm = self.lateral_conv(feat_l)  # 0~1 * feats\n        offset = self.offset(torch.cat([feat_arm, feat_up * 2], dim=1))  # concat for offset by compute the dif\n        feat_align = self.relu(self.dcpack_L2([feat_up, offset], main_path))  # [feat, offset]\n        return feat_align + feat_arm\n\n\nclass FAN(Backbone):\n    \"\"\"\n    This module implements :paper:`FPN`.\n    It creates pyramid features built on top of some input feature maps.\n    \"\"\"\n\n    def __init__(self, bottom_up, in_features, out_channels, norm=\"\", top_block=None, fuse_type=\"sum\"):\n        \"\"\"\n        Args:\n            bottom_up (Backbone): module representing the bottom up subnetwork.\n                Must be a subclass of :class:`Backbone`. The multi-scale feature\n                maps generated by the bottom up network, and listed in `in_features`,\n                are used to generate FPN levels.\n            in_features (list[str]): names of the input feature maps coming\n                from the backbone to which FPN is attached. For example, if the\n                backbone produces [\"res2\", \"res3\", \"res4\"], any *contiguous* sublist\n                of these may be used; order must be from high to low resolution.\n            out_channels (int): number of channels in the output feature maps.\n            norm (str): the normalization to use.\n            top_block (nn.Module or None): if provided, an extra operation will\n                be performed on the output of the last (smallest resolution)\n                FPN output, and the result will extend the result list. The top_block\n                further downsamples the feature map. It must have an attribute\n                \"num_levels\", meaning the number of extra FPN levels added by\n                this block, and \"in_feature\", which is a string representing\n                its input feature (e.g., p5).\n            fuse_type (str): types for fusing the top down features and the lateral\n                ones. It can be \"sum\" (default), which sums up element-wise; or \"avg\",\n                which takes the element-wise mean of the two.\n        \"\"\"\n        super(FAN, self).__init__()\n        assert isinstance(bottom_up, Backbone)\n\n        # Feature map strides and channels from the bottom up network (e.g. ResNet)\n        input_shapes = bottom_up.output_shape()\n        strides = [input_shapes[f].stride for f in in_features]\n        in_channels_per_feature = [input_shapes[f].channels for f in in_features]\n\n        _assert_strides_are_log2_contiguous(strides)\n        align_modules = []\n        output_convs = []\n\n        use_bias = norm == \"\"\n        for idx, in_channels in enumerate(in_channels_per_feature[:-1]):\n            stage = int(math.log2(strides[idx]))\n            lateral_norm = get_norm(norm, out_channels)\n            align_module = FeatureAlign_V2(in_channels, out_channels, norm=lateral_norm)  # proposed fapn\n            self.add_module(\"fan_align{}\".format(stage), align_module)\n            align_modules.append(align_module)\n            output_conv = Conv2d(out_channels, out_channels, kernel_size=3, stride=1, padding=1, bias=use_bias,\n                                 norm=get_norm(norm, out_channels), )\n            weight_init.c2_xavier_fill(output_conv)\n            self.add_module(\"fpn_output{}\".format(stage), output_conv)\n            output_convs.append(output_conv)\n        stage = int(math.log2(strides[len(in_channels_per_feature) - 1]))\n        lateral_conv = Conv2d(in_channels_per_feature[-1], out_channels, kernel_size=1, bias=use_bias,\n                              norm=get_norm(norm, out_channels))\n        align_modules.append(lateral_conv)\n        self.add_module(\"fan_align{}\".format(stage), lateral_conv)\n        # Place convs into top-down order (from low to high resolution) to make the top-down computation in forward clearer.\n        self.align_modules = align_modules[::-1]\n        self.output_convs = output_convs[::-1]\n        self.top_block = top_block\n        self.in_features = in_features\n        self.bottom_up = bottom_up\n        # Return feature names are \"p<stage>\", like [\"p2\", \"p3\", ..., \"p6\"]\n        self._out_feature_strides = {\"p{}\".format(int(math.log2(s))): s for s in strides}\n        # top block output feature maps.\n        if self.top_block is not None:\n            for s in range(stage, stage + self.top_block.num_levels):\n                self._out_feature_strides[\"p{}\".format(s + 1)] = 2 ** (s + 1)\n        self._out_features = list(self._out_feature_strides.keys())\n        self._out_feature_channels = {k: out_channels for k in self._out_features}\n        self._size_divisibility = strides[-1]\n        assert fuse_type in {\"avg\", \"sum\"}\n        self._fuse_type = fuse_type\n\n    @property\n    def size_divisibility(self):\n        return self._size_divisibility\n\n    def forward(self, x):\n        \"\"\"\n        Args:\n            input (dict[str->Tensor]): mapping feature map name (e.g., \"res5\") to\n                feature map tensor for each feature level in high to low resolution order.\n\n        Returns:\n            dict[str->Tensor]:\n                mapping from feature map name to FPN feature map tensor\n                in high to low resolution order. Returned feature names follow the FPN\n                paper convention: \"p<stage>\", where stage has stride = 2 ** stage e.g.,\n                [\"p2\", \"p3\", ..., \"p6\"].\n        \"\"\"\n        # Reverse feature maps into top-down order (from low to high resolution)\n        bottom_up_features = self.bottom_up(x)\n        x = [bottom_up_features[f] for f in self.in_features[::-1]]\n        results = []\n        prev_features = self.align_modules[0](x[0])\n        results.append(prev_features)\n        for features, align_module, output_conv in zip(x[1:], self.align_modules[1:], self.output_convs[0:]):\n            prev_features = align_module(features, prev_features)\n            results.insert(0, output_conv(prev_features))\n\n        if self.top_block is not None:\n            top_block_in_feature = bottom_up_features.get(self.top_block.in_feature, None)\n            if top_block_in_feature is None:\n                top_block_in_feature = results[self._out_features.index(self.top_block.in_feature)]\n            results.extend(self.top_block(top_block_in_feature))\n        assert len(self._out_features) == len(results)\n        return dict(zip(self._out_features, results))\n\n    def output_shape(self):\n        return {name: ShapeSpec(channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]) for\n                name in self._out_features}\n\n\ndef _assert_strides_are_log2_contiguous(strides):\n    \"\"\"\n    Assert that each stride is 2x times its preceding stride, i.e. \"contiguous in log2\".\n    \"\"\"\n    for i, stride in enumerate(strides[1:], 1):\n        assert stride == 2 * strides[i - 1], \"Strides {} {} are not log2 contiguous\".format(stride, strides[i - 1])\n\n\nclass LastLevelMaxPool(nn.Module):\n    \"\"\"\n    This module is used in the original FPN to generate a downsampled\n    P6 feature from P5.\n    \"\"\"\n\n    def __init__(self):\n        super().__init__()\n        self.num_levels = 1\n        self.in_feature = \"p5\"\n\n    def forward(self, x):\n        return [F.max_pool2d(x, kernel_size=1, stride=2, padding=0)]\n\n\nclass LastLevelP6P7(nn.Module):\n    \"\"\"\n    This module is used in RetinaNet to generate extra layers, P6 and P7 from\n    C5 feature.\n    \"\"\"\n\n    def __init__(self, in_channels, out_channels, in_feature=\"res5\"):\n        super().__init__()\n        self.num_levels = 2\n        self.in_feature = in_feature\n        self.p6 = nn.Conv2d(in_channels, out_channels, 3, 2, 1)\n        self.p7 = nn.Conv2d(out_channels, out_channels, 3, 2, 1)\n        for module in [self.p6, self.p7]:\n            weight_init.c2_xavier_fill(module)\n\n    def forward(self, c5):\n        p6 = self.p6(c5)\n        p7 = self.p7(F.relu(p6))\n        return [p6, p7]\n\n\n@BACKBONE_REGISTRY.register()\ndef build_resnet_fan_backbone(cfg, input_shape: ShapeSpec):\n    \"\"\"\n    Args:\n        cfg: a detectron2 CfgNode\n\n    Returns:\n        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.\n    \"\"\"\n    bottom_up = build_resnet_backbone(cfg, input_shape)\n    in_features = cfg.MODEL.FPN.IN_FEATURES\n    out_channels = cfg.MODEL.FPN.OUT_CHANNELS\n    backbone = FAN(bottom_up=bottom_up, in_features=in_features, out_channels=out_channels,\n                   norm=cfg.MODEL.FPN.NORM, top_block=LastLevelMaxPool(), fuse_type=cfg.MODEL.FPN.FUSE_TYPE,\n                   )\n    return backbone\n\n\n@BACKBONE_REGISTRY.register()\ndef build_retinanet_resnet_fan_backbone(cfg, input_shape: ShapeSpec):\n    \"\"\"\n    Args:\n        cfg: a detectron2 CfgNode\n\n    Returns:\n        backbone (Backbone): backbone module, must be a subclass of :class:`Backbone`.\n    \"\"\"\n    bottom_up = build_resnet_backbone(cfg, input_shape)\n    in_features = cfg.MODEL.FPN.IN_FEATURES\n    out_channels = cfg.MODEL.FPN.OUT_CHANNELS\n    in_channels_p6p7 = bottom_up.output_shape()[\"res5\"].channels\n    backbone = FAN(bottom_up=bottom_up, in_features=in_features, out_channels=out_channels, norm=cfg.MODEL.FPN.NORM,\n                   top_block=LastLevelP6P7(in_channels_p6p7, out_channels), fuse_type=cfg.MODEL.FPN.FUSE_TYPE, )\n    return backbone\n"
  },
  {
    "path": "projects/PointRend/configs/InstanceSegmentation/Base-PointRend-RCNN-FAN.yaml",
    "content": "_BASE_: \"../../../../configs/Base-RCNN-FAN.yaml\"\nMODEL:\n  MASK_ON: true\n  ROI_HEADS:\n    NAME: \"PointRendROIHeads\"\n    IN_FEATURES: [\"p2\", \"p3\", \"p4\", \"p5\"]\n  ROI_BOX_HEAD:\n    TRAIN_ON_PRED_BOXES: True\n  ROI_MASK_HEAD:\n    NAME: \"CoarseMaskHead\"\n    FC_DIM: 1024\n    NUM_FC: 2\n    OUTPUT_SIDE_RESOLUTION: 7\n    IN_FEATURES: [\"p2\"]\n    POINT_HEAD_ON: True\n  POINT_HEAD:\n    FC_DIM: 256\n    NUM_FC: 3\n    IN_FEATURES: [\"p2\"]\nINPUT:\n  # PointRend for instance segmenation does not work with \"polygon\" mask_format.\n  MASK_FORMAT: \"bitmask\"\n\n#DATASETS:\n#  TEST: (\"lvis_v1_val_cocofied\",)"
  },
  {
    "path": "projects/PointRend/configs/InstanceSegmentation/pointrend_rcnn_R_50_FAN_1x_coco.yaml",
    "content": "_BASE_: Base-PointRend-RCNN-FAN.yaml\nMODEL:\n  WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl\n#  WEIGHTS: \"path/pointrend_coco_r50_1x_fan/model_final.pth\"\n  RESNETS:\n    DEPTH: 50\n# To add COCO AP evaluation against the higher-quality LVIS annotations.\n# DATASETS:\n#   TEST: (\"coco_2017_val\", \"lvis_v0.5_val_cocofied\")"
  },
  {
    "path": "projects/PointRend/configs/SemanticSegmentation/Base-PointRend-Semantic-FAN.yaml",
    "content": "_BASE_: \"../../../../configs/Base-RCNN-FAN.yaml\"\nMODEL:\n  META_ARCHITECTURE: \"SemanticSegmentor\"\n  BACKBONE:\n    FREEZE_AT: 0\n  SEM_SEG_HEAD:\n    NAME: \"PointRendSemSegHead\"\n  POINT_HEAD:\n    NUM_CLASSES: 54\n    FC_DIM: 256\n    NUM_FC: 3\n    IN_FEATURES: [\"p2\"]\n    TRAIN_NUM_POINTS: 1024\n    SUBDIVISION_STEPS: 2\n    SUBDIVISION_NUM_POINTS: 8192\n    COARSE_SEM_SEG_HEAD_NAME: \"SemSegFPNHead\"\n    COARSE_PRED_EACH_LAYER: False\nDATASETS:\n  TRAIN: (\"coco_2017_train_panoptic_stuffonly\",)\n  TEST: (\"coco_2017_val_panoptic_stuffonly\",)\n"
  },
  {
    "path": "projects/PointRend/configs/SemanticSegmentation/pointrend_semantic_R_101_FAN_1x_cityscapes.yaml",
    "content": "_BASE_: Base-PointRend-Semantic-FAN.yaml\nMODEL:\n  WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-101.pkl\n#  WEIGHTS: \"path/pointrend_cityscapes_r101_1x_fan/model_final.pth\"\n  RESNETS:\n    DEPTH: 101\n  SEM_SEG_HEAD:\n    NUM_CLASSES: 19\n  POINT_HEAD:\n    NUM_CLASSES: 19\n    TRAIN_NUM_POINTS: 2048\n    SUBDIVISION_NUM_POINTS: 8192\nDATASETS:\n  TRAIN: (\"cityscapes_fine_sem_seg_train\",)\n  TEST: (\"cityscapes_fine_sem_seg_val\",)\nSOLVER:\n  BASE_LR: 0.01\n  STEPS: (40000, 55000)\n  MAX_ITER: 65000\n  IMS_PER_BATCH: 32\nINPUT:\n  MIN_SIZE_TRAIN: (512, 768, 1024, 1280, 1536, 1792, 2048)\n  MIN_SIZE_TRAIN_SAMPLING: \"choice\"\n  MIN_SIZE_TEST: 1024\n  MAX_SIZE_TRAIN: 4096\n  MAX_SIZE_TEST: 2048\n  CROP:\n    ENABLED: True\n    TYPE: \"absolute\"\n    SIZE: (512, 1024)\n    SINGLE_CATEGORY_MAX_AREA: 0.75\n  COLOR_AUG_SSD: True\nDATALOADER:\n  NUM_WORKERS: 10\n\nOUTPUT_DIR: \"./pointrend_cityscapes_r101_1x_fan\"\n"
  },
  {
    "path": "projects/PointRend/configs/SemanticSegmentation/pointrend_semantic_R_50_FAN_1x_cityscapes.yaml",
    "content": "_BASE_: Base-PointRend-Semantic-FAN.yaml\nMODEL:\n  WEIGHTS: detectron2://ImageNetPretrained/MSRA/R-50.pkl\n#  WEIGHTS: \"path/pointrend_cityscapes_r50_1x_fan/model_final.pth\"\n  RESNETS:\n    DEPTH: 50\n  SEM_SEG_HEAD:\n    NUM_CLASSES: 19\n  POINT_HEAD:\n    NUM_CLASSES: 19\n    TRAIN_NUM_POINTS: 2048\n    SUBDIVISION_NUM_POINTS: 8192\nDATASETS:\n  TRAIN: (\"cityscapes_fine_sem_seg_train\",)\n  TEST: (\"cityscapes_fine_sem_seg_val\",)\nSOLVER:\n  BASE_LR: 0.01\n  STEPS: (40000, 55000)\n  MAX_ITER: 65000\n  IMS_PER_BATCH: 32\nINPUT:\n  MIN_SIZE_TRAIN: (512, 768, 1024, 1280, 1536, 1792, 2048)\n  MIN_SIZE_TRAIN_SAMPLING: \"choice\"\n  MIN_SIZE_TEST: 1024\n  MAX_SIZE_TRAIN: 4096\n  MAX_SIZE_TEST: 2048\n  CROP:\n    ENABLED: True\n    TYPE: \"absolute\"\n    SIZE: (512, 1024)\n    SINGLE_CATEGORY_MAX_AREA: 0.75\n  COLOR_AUG_SSD: True\nDATALOADER:\n  NUM_WORKERS: 10\n"
  }
]