Full Code of tangtaogo/lidar-nerf for AI

main 8083a1d74eef cached

90 files

579.5 KB

153.6k tokens

367 symbols

1 requests

Download .txt

Showing preview only (610K chars total). Download the full file or copy to clipboard to get everything.

Repository: tangtaogo/lidar-nerf
Branch: main
Commit: 8083a1d74eef
Files: 90
Total size: 579.5 KB

Directory structure:
gitextract_gza2uyyg/

├── .github/
│   └── workflows/
│       └── formatter.yml
├── LICENSE
├── configs/
│   ├── kitti360_1538.txt
│   ├── kitti360_1728.txt
│   ├── kitti360_1908.txt
│   ├── kitti360_3353.txt
│   └── nerf_mvl.txt
├── extern/
│   ├── chamfer3D/
│   │   ├── chamfer3D.cu
│   │   ├── chamfer_cuda.cpp
│   │   ├── dist_chamfer_3D.py
│   │   └── setup.py
│   └── fscore.py
├── lidarmvl/
│   └── readme.md
├── lidarnerf/
│   ├── __init__.py
│   ├── activation.py
│   ├── convert.py
│   ├── dataset/
│   │   ├── base_dataset.py
│   │   ├── kitti360_dataset.py
│   │   └── nerfmvl_dataset.py
│   ├── encoding.py
│   ├── ffmlp/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── ffmlp.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── cutlass_matmul.h
│   │       ├── ffmlp.cu
│   │       ├── ffmlp.h
│   │       └── utils.h
│   ├── freqencoder/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── freq.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── freqencoder.cu
│   │       └── freqencoder.h
│   ├── gridencoder/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── grid.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── gridencoder.cu
│   │       └── gridencoder.h
│   ├── loss.py
│   ├── nerf/
│   │   ├── network.py
│   │   ├── network_tcnn.py
│   │   ├── renderer.py
│   │   └── utils.py
│   ├── raymarching/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── raymarching.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── raymarching.cu
│   │       └── raymarching.h
│   └── shencoder/
│       ├── __init__.py
│       ├── backend.py
│       ├── setup.py
│       ├── sphere_harmonics.py
│       └── src/
│           ├── bindings.cpp
│           ├── shencoder.cu
│           └── shencoder.h
├── lidarnvs/
│   ├── __init__.py
│   ├── configs/
│   │   ├── pcgen_kitti360_raydrop.txt
│   │   └── pcgen_nerfmvl_raydrop.txt
│   ├── eval.py
│   ├── lidarnvs_base.py
│   ├── lidarnvs_meshing.py
│   ├── lidarnvs_nksr.py
│   ├── lidarnvs_pcgen.py
│   ├── lidarnvs_poisson.py
│   ├── loader.py
│   ├── plot_possion_grid_search.py
│   ├── raydrop_dataset_poisson.py
│   ├── raydrop_train_pcgen.py
│   ├── raydrop_train_poisson.py
│   ├── readme.md
│   ├── run.py
│   └── unet.py
├── main_lidarnerf.py
├── preprocess/
│   ├── cal_centerpose_bound.py
│   ├── generate_train_rangeview.py
│   ├── kitti360_loader.py
│   ├── kitti360_to_nerf.py
│   ├── nerfmvl_loader.py
│   └── nerfmvl_to_nerf.py
├── readme.md
├── requirements.txt
├── requirements_torch.txt
└── setup.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .github/workflows/formatter.yml
================================================
name: Formatter

on:
  push:
    branches:
      - main
  pull_request:
    types: [opened, reopened, synchronize]

jobs:
  formatter:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: psf/black@stable


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2022 Tao Tang, Yixing Lao

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: configs/kitti360_1538.txt
================================================
sequence_id = 1538
alpha_d = 1000.0
alpha_r = 1
alpha_i = 1e1
alpha_grad = 100.0
grad_loss = True
desired_resolution = 32768
change_patch_size_lidar = [2, 8]
num_steps = 768
upsample_steps = 64
bound = 1
scale = 0.01150158050828236
offset = [1150.2651429096413, 3997.2217130085182, 109.3943550832148]


================================================
FILE: configs/kitti360_1728.txt
================================================
sequence_id = 1728
alpha_d = 1000.0
alpha_r = 1
alpha_i = 1e1
alpha_grad = 100.0
grad_loss = True
desired_resolution = 32768
change_patch_size_lidar = [2, 8]
num_steps = 768
upsample_steps = 64
bound = 1
scale = 0.01235117157331213
offset = [1036.6078389848537, 3863.5989919125104, 111.73904860790459]

================================================
FILE: configs/kitti360_1908.txt
================================================
sequence_id = 1908
alpha_d = 1000.0
alpha_r = 1
alpha_i = 1e1
alpha_grad = 100.0
grad_loss = True
desired_resolution = 32768
change_patch_size_lidar = [2, 8]
num_steps = 768
upsample_steps = 64
bound = 1
scale = 0.010784853507573345
offset = [1069.988979297527, 3765.8807850056446, 113.0212841477088]


================================================
FILE: configs/kitti360_3353.txt
================================================
sequence_id = 3353
alpha_d = 1000.0
alpha_r = 1
alpha_i = 1e1
alpha_grad = 100.0
grad_loss = True
desired_resolution = 32768
change_patch_size_lidar = [2, 8]
num_steps = 768
upsample_steps = 64
bound = 1
scale = 0.00951045294058913
offset = [1364.3592435499154, 3818.620913210761, 108.69906656243805]

================================================
FILE: configs/nerf_mvl.txt
================================================
path = data/nerf_mvl
dataloader = nerf_mvl
sequence_id = car
alpha_d = 1000.0
alpha_r = 1
alpha_i = 1
alpha_grad = 100.0
intensity_inv_scale=255.0
grad_loss = False
desired_resolution = 32768
eval_interval=5
num_steps = 768
upsample_steps = 64
bound = 1
scale = 0.005
offset = [973.0483450856506, 648.3910430331337, -8.442160936778045]


================================================
FILE: extern/chamfer3D/chamfer3D.cu
================================================

#include <ATen/ATen.h>
#include <cuda.h>
#include <cuda_runtime.h>
#include <stdio.h>

#include <vector>

__global__ void NmDistanceKernel(int b,
                                 int n,
                                 const float *xyz,
                                 int m,
                                 const float *xyz2,
                                 float *result,
                                 int *result_i) {
    const int batch = 512;
    __shared__ float buf[batch * 3];
    for (int i = blockIdx.x; i < b; i += gridDim.x) {
        for (int k2 = 0; k2 < m; k2 += batch) {
            int end_k = min(m, k2 + batch) - k2;
            for (int j = threadIdx.x; j < end_k * 3; j += blockDim.x) {
                buf[j] = xyz2[(i * m + k2) * 3 + j];
            }
            __syncthreads();
            for (int j = threadIdx.x + blockIdx.y * blockDim.x; j < n;
                 j += blockDim.x * gridDim.y) {
                float x1 = xyz[(i * n + j) * 3 + 0];
                float y1 = xyz[(i * n + j) * 3 + 1];
                float z1 = xyz[(i * n + j) * 3 + 2];
                int best_i = 0;
                float best = 0;
                int end_ka = end_k - (end_k & 3);
                if (end_ka == batch) {
                    for (int k = 0; k < batch; k += 4) {
                        {
                            float x2 = buf[k * 3 + 0] - x1;
                            float y2 = buf[k * 3 + 1] - y1;
                            float z2 = buf[k * 3 + 2] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (k == 0 || d < best) {
                                best = d;
                                best_i = k + k2;
                            }
                        }
                        {
                            float x2 = buf[k * 3 + 3] - x1;
                            float y2 = buf[k * 3 + 4] - y1;
                            float z2 = buf[k * 3 + 5] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (d < best) {
                                best = d;
                                best_i = k + k2 + 1;
                            }
                        }
                        {
                            float x2 = buf[k * 3 + 6] - x1;
                            float y2 = buf[k * 3 + 7] - y1;
                            float z2 = buf[k * 3 + 8] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (d < best) {
                                best = d;
                                best_i = k + k2 + 2;
                            }
                        }
                        {
                            float x2 = buf[k * 3 + 9] - x1;
                            float y2 = buf[k * 3 + 10] - y1;
                            float z2 = buf[k * 3 + 11] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (d < best) {
                                best = d;
                                best_i = k + k2 + 3;
                            }
                        }
                    }
                } else {
                    for (int k = 0; k < end_ka; k += 4) {
                        {
                            float x2 = buf[k * 3 + 0] - x1;
                            float y2 = buf[k * 3 + 1] - y1;
                            float z2 = buf[k * 3 + 2] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (k == 0 || d < best) {
                                best = d;
                                best_i = k + k2;
                            }
                        }
                        {
                            float x2 = buf[k * 3 + 3] - x1;
                            float y2 = buf[k * 3 + 4] - y1;
                            float z2 = buf[k * 3 + 5] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (d < best) {
                                best = d;
                                best_i = k + k2 + 1;
                            }
                        }
                        {
                            float x2 = buf[k * 3 + 6] - x1;
                            float y2 = buf[k * 3 + 7] - y1;
                            float z2 = buf[k * 3 + 8] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (d < best) {
                                best = d;
                                best_i = k + k2 + 2;
                            }
                        }
                        {
                            float x2 = buf[k * 3 + 9] - x1;
                            float y2 = buf[k * 3 + 10] - y1;
                            float z2 = buf[k * 3 + 11] - z1;
                            float d = x2 * x2 + y2 * y2 + z2 * z2;
                            if (d < best) {
                                best = d;
                                best_i = k + k2 + 3;
                            }
                        }
                    }
                }
                for (int k = end_ka; k < end_k; k++) {
                    float x2 = buf[k * 3 + 0] - x1;
                    float y2 = buf[k * 3 + 1] - y1;
                    float z2 = buf[k * 3 + 2] - z1;
                    float d = x2 * x2 + y2 * y2 + z2 * z2;
                    if (k == 0 || d < best) {
                        best = d;
                        best_i = k + k2;
                    }
                }
                if (k2 == 0 || result[(i * n + j)] > best) {
                    result[(i * n + j)] = best;
                    result_i[(i * n + j)] = best_i;
                }
            }
            __syncthreads();
        }
    }
}
// int chamfer_cuda_forward(int b,int n,const float * xyz,int m,const float *
// xyz2,float * result,int * result_i,float * result2,int * result2_i,
// cudaStream_t stream){
int chamfer_cuda_forward(at::Tensor xyz1,
                         at::Tensor xyz2,
                         at::Tensor dist1,
                         at::Tensor dist2,
                         at::Tensor idx1,
                         at::Tensor idx2) {
    const auto batch_size = xyz1.size(0);
    const auto n = xyz1.size(1);  // num_points point cloud A
    const auto m = xyz2.size(1);  // num_points point cloud B

    NmDistanceKernel<<<dim3(32, 16, 1), 512>>>(
            batch_size, n, xyz1.data<float>(), m, xyz2.data<float>(),
            dist1.data<float>(), idx1.data<int>());
    NmDistanceKernel<<<dim3(32, 16, 1), 512>>>(
            batch_size, m, xyz2.data<float>(), n, xyz1.data<float>(),
            dist2.data<float>(), idx2.data<int>());

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("error in nnd updateOutput: %s\n", cudaGetErrorString(err));
        // THError("aborting");
        return 0;
    }
    return 1;
}
__global__ void NmDistanceGradKernel(int b,
                                     int n,
                                     const float *xyz1,
                                     int m,
                                     const float *xyz2,
                                     const float *grad_dist1,
                                     const int *idx1,
                                     float *grad_xyz1,
                                     float *grad_xyz2) {
    for (int i = blockIdx.x; i < b; i += gridDim.x) {
        for (int j = threadIdx.x + blockIdx.y * blockDim.x; j < n;
             j += blockDim.x * gridDim.y) {
            float x1 = xyz1[(i * n + j) * 3 + 0];
            float y1 = xyz1[(i * n + j) * 3 + 1];
            float z1 = xyz1[(i * n + j) * 3 + 2];
            int j2 = idx1[i * n + j];
            float x2 = xyz2[(i * m + j2) * 3 + 0];
            float y2 = xyz2[(i * m + j2) * 3 + 1];
            float z2 = xyz2[(i * m + j2) * 3 + 2];
            float g = grad_dist1[i * n + j] * 2;
            atomicAdd(&(grad_xyz1[(i * n + j) * 3 + 0]), g * (x1 - x2));
            atomicAdd(&(grad_xyz1[(i * n + j) * 3 + 1]), g * (y1 - y2));
            atomicAdd(&(grad_xyz1[(i * n + j) * 3 + 2]), g * (z1 - z2));
            atomicAdd(&(grad_xyz2[(i * m + j2) * 3 + 0]), -(g * (x1 - x2)));
            atomicAdd(&(grad_xyz2[(i * m + j2) * 3 + 1]), -(g * (y1 - y2)));
            atomicAdd(&(grad_xyz2[(i * m + j2) * 3 + 2]), -(g * (z1 - z2)));
        }
    }
}
// int chamfer_cuda_backward(int b,int n,const float * xyz1,int m,const float *
// xyz2,const float * grad_dist1,const int * idx1,const float * grad_dist2,const
// int * idx2,float * grad_xyz1,float * grad_xyz2, cudaStream_t stream){
int chamfer_cuda_backward(at::Tensor xyz1,
                          at::Tensor xyz2,
                          at::Tensor gradxyz1,
                          at::Tensor gradxyz2,
                          at::Tensor graddist1,
                          at::Tensor graddist2,
                          at::Tensor idx1,
                          at::Tensor idx2) {
    // cudaMemset(grad_xyz1,0,b*n*3*4);
    // cudaMemset(grad_xyz2,0,b*m*3*4);

    const auto batch_size = xyz1.size(0);
    const auto n = xyz1.size(1);  // num_points point cloud A
    const auto m = xyz2.size(1);  // num_points point cloud B

    NmDistanceGradKernel<<<dim3(1, 16, 1), 256>>>(
            batch_size, n, xyz1.data<float>(), m, xyz2.data<float>(),
            graddist1.data<float>(), idx1.data<int>(), gradxyz1.data<float>(),
            gradxyz2.data<float>());
    NmDistanceGradKernel<<<dim3(1, 16, 1), 256>>>(
            batch_size, m, xyz2.data<float>(), n, xyz1.data<float>(),
            graddist2.data<float>(), idx2.data<int>(), gradxyz2.data<float>(),
            gradxyz1.data<float>());

    cudaError_t err = cudaGetLastError();
    if (err != cudaSuccess) {
        printf("error in nnd get grad: %s\n", cudaGetErrorString(err));
        // THError("aborting");
        return 0;
    }
    return 1;
}


================================================
FILE: extern/chamfer3D/chamfer_cuda.cpp
================================================
#include <torch/torch.h>

#include <vector>

/// TMP
// #include "common.h"
/// NOT TMP

int chamfer_cuda_forward(at::Tensor xyz1,
                         at::Tensor xyz2,
                         at::Tensor dist1,
                         at::Tensor dist2,
                         at::Tensor idx1,
                         at::Tensor idx2);

int chamfer_cuda_backward(at::Tensor xyz1,
                          at::Tensor xyz2,
                          at::Tensor gradxyz1,
                          at::Tensor gradxyz2,
                          at::Tensor graddist1,
                          at::Tensor graddist2,
                          at::Tensor idx1,
                          at::Tensor idx2);

int chamfer_forward(at::Tensor xyz1,
                    at::Tensor xyz2,
                    at::Tensor dist1,
                    at::Tensor dist2,
                    at::Tensor idx1,
                    at::Tensor idx2) {
    return chamfer_cuda_forward(xyz1, xyz2, dist1, dist2, idx1, idx2);
}

int chamfer_backward(at::Tensor xyz1,
                     at::Tensor xyz2,
                     at::Tensor gradxyz1,
                     at::Tensor gradxyz2,
                     at::Tensor graddist1,
                     at::Tensor graddist2,
                     at::Tensor idx1,
                     at::Tensor idx2) {
    return chamfer_cuda_backward(xyz1, xyz2, gradxyz1, gradxyz2, graddist1,
                                 graddist2, idx1, idx2);
}

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("forward", &chamfer_forward, "chamfer forward (CUDA)");
    m.def("backward", &chamfer_backward, "chamfer backward (CUDA)");
}


================================================
FILE: extern/chamfer3D/dist_chamfer_3D.py
================================================
from torch import nn
from torch.autograd import Function
import torch
import importlib
import os
import sys
from pathlib import Path

script_dir = Path(__file__).parent.absolute()
object_dir = script_dir.parent / "tmp"
sys.path.append(str(object_dir))

chamfer_found = importlib.find_loader("chamfer_3D") is not None

if not chamfer_found:
    ## Cool trick from https://github.com/chrdiller
    cur_path = os.path.dirname(os.path.abspath(__file__))
    build_path = cur_path.replace("chamfer3D", "tmp")
    os.makedirs(build_path, exist_ok=True)
    print(f"Jitting Chamfer 3D to {build_path}")

    from torch.utils.cpp_extension import load

    chamfer_3D = load(
        name="chamfer_3D",
        sources=[
            "/".join(os.path.abspath(__file__).split("/")[:-1] + ["chamfer_cuda.cpp"]),
            "/".join(os.path.abspath(__file__).split("/")[:-1] + ["chamfer3D.cu"]),
        ],
        build_directory=build_path,
    )
    print(f"Loaded jitted library {chamfer_3D.__file__}")
else:
    import chamfer_3D

    print(f"Loaded pre-compiled library {chamfer_3D.__file__}")


# Chamfer's distance module @thibaultgroueix
# GPU tensors only
class chamfer_3DFunction(Function):
    @staticmethod
    def forward(ctx, xyz1, xyz2):
        batchsize, n, dim = xyz1.size()
        assert (
            dim == 3
        ), "Wrong last dimension for the chamfer distance 's input! Check with .size()"
        _, m, dim = xyz2.size()
        assert (
            dim == 3
        ), "Wrong last dimension for the chamfer distance 's input! Check with .size()"
        device = xyz1.device

        device = xyz1.device

        dist1 = torch.zeros(batchsize, n)
        dist2 = torch.zeros(batchsize, m)

        idx1 = torch.zeros(batchsize, n).type(torch.IntTensor)
        idx2 = torch.zeros(batchsize, m).type(torch.IntTensor)

        dist1 = dist1.to(device)
        dist2 = dist2.to(device)
        idx1 = idx1.to(device)
        idx2 = idx2.to(device)
        torch.cuda.set_device(device)

        chamfer_3D.forward(xyz1, xyz2, dist1, dist2, idx1, idx2)
        ctx.save_for_backward(xyz1, xyz2, idx1, idx2)
        return dist1, dist2, idx1, idx2

    @staticmethod
    def backward(ctx, graddist1, graddist2, gradidx1, gradidx2):
        xyz1, xyz2, idx1, idx2 = ctx.saved_tensors
        graddist1 = graddist1.contiguous()
        graddist2 = graddist2.contiguous()
        device = graddist1.device

        gradxyz1 = torch.zeros(xyz1.size())
        gradxyz2 = torch.zeros(xyz2.size())

        gradxyz1 = gradxyz1.to(device)
        gradxyz2 = gradxyz2.to(device)
        chamfer_3D.backward(
            xyz1, xyz2, gradxyz1, gradxyz2, graddist1, graddist2, idx1, idx2
        )
        return gradxyz1, gradxyz2


class chamfer_3DDist(nn.Module):
    def __init__(self):
        super(chamfer_3DDist, self).__init__()

    def forward(self, input1, input2):
        input1 = input1.contiguous()
        input2 = input2.contiguous()
        return chamfer_3DFunction.apply(input1, input2)


================================================
FILE: extern/chamfer3D/setup.py
================================================
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

setup(
    name="chamfer_3D",
    ext_modules=[
        CUDAExtension(
            "chamfer_3D",
            [
                "/".join(__file__.split("/")[:-1] + ["chamfer_cuda.cpp"]),
                "/".join(__file__.split("/")[:-1] + ["chamfer3D.cu"]),
            ],
        ),
    ],
    cmdclass={"build_ext": BuildExtension},
)


================================================
FILE: extern/fscore.py
================================================
import torch


def fscore(dist1, dist2, threshold=0.001):
    """
    Calculates the F-score between two point clouds with the corresponding threshold value.
    :param dist1: Batch, N-Points
    :param dist2: Batch, N-Points
    :param th: float
    :return: fscore, precision, recall
    """
    # NB : In this depo, dist1 and dist2 are squared pointcloud euclidean
    # distances, so you should adapt the threshold accordingly.
    precision_1 = torch.mean((dist1 < threshold).float(), dim=1)
    precision_2 = torch.mean((dist2 < threshold).float(), dim=1)
    fscore = 2 * precision_1 * precision_2 / (precision_1 + precision_2)
    fscore[torch.isnan(fscore)] = 0
    return fscore, precision_1, precision_2


================================================
FILE: lidarmvl/readme.md
================================================
# LiDAR-MVL

![dataset_vis.png](../assets/dataset_vis.png)

| Sensor                        | Details (Sensor location: F: front. T: top.)                 |
| ----------------------------- | ------------------------------------------------------------ |
| LiDAR                 LiDAR-F | Spinning, 64 beams, 10Hz capture frequency, 360° horizontal FOV, 0.6° horizontal resolution, -52.1° to +52.1° vertical FOV, ≤60m range, ±3cm accuracy. |
| LiDAR-T                       | Spinning, 64 beams, 20Hz capture frequency, 360° horizontal FOV, 0.4° horizontal resolution, -25° to +15° vertical FOV, ≤200m range, ±2cm accuracy. |

We establish an object-centric **m**ulti-**v**iew **L**iDAR dataset, which we
dub the **NeRF-MVL** dataset, containing carefully calibrated sensor poses,
acquired from multi-LiDAR sensor data from real autonomous vehicles. It contains
more than **76k frames** covering two types of collecting vehicles, three LiDAR
settings, two collecting paths, and nine object categories.



## Dataset Format

```bash
nerf_mvl
├── nerf_mvl_76k
│   ├── vehicle_type1
│   │   ├── LiDAR
│   │   │   └── {class_name}
│   │   │       ├── l
│   │   │       └── s
│   │   │           ├── {frame_id}.npy
│   │   │           └── lidar2world.txt
│   │   ├── LiDAR_F
│   │   └── LiDAR_T
│   └── vehicle_type2
│       ├── LiDAR
│       ├── LiDAR_F
│       └── LiDAR_T
│
└── nerf_mvl_7k
    └── {class_name}
        ├── {frame_id}.npy
        └── lidar2world.txt

Note:
{class_name}: {bollard, pedestrian, plant, traffic_cone, water_safety_barrier, car, pier, tire, warning_sign}
{frame_id}.npy: local point clouds, (N,4)
lidar2world.txt: the lidar to world matrix, (M, 16)
l/s: large/small collecting paths
```

For fast validation, we extract a  pocket version of the dataset with only 7.3k
frames covering the nine categories, called **nerf_mvl_7k**.

For all point clound frames, we  crop out the region of interest, i.e., the
object. The raw data will also be released to the community soon.



## Citation
If you find our dataset helps, please consider citing:

```
@article{tao2023lidar,
  title={LiDAR-NeRF: Novel LiDAR View Synthesis via Neural Radiance Fields},
  author={Tao, Tang and Gao, Longfei and Wang, Guangrun and Lao, Yixing and Chen, Peng and Zhao hengshuang and Hao, Dayang and Liang, Xiaodan and Salzmann, Mathieu and Yu, Kaicheng},
  journal={arXiv preprint arXiv:2304.10406},
  year={2023}
}
```


================================================
FILE: lidarnerf/__init__.py
================================================
__version__ = "0.1.0"


================================================
FILE: lidarnerf/activation.py
================================================
import torch
from torch.autograd import Function
from torch.cuda.amp import custom_bwd, custom_fwd


class _trunc_exp(Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float32)  # cast to float32
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return torch.exp(x)

    @staticmethod
    @custom_bwd
    def backward(ctx, g):
        x = ctx.saved_tensors[0]
        return g * torch.exp(x.clamp(-15, 15))


trunc_exp = _trunc_exp.apply


================================================
FILE: lidarnerf/convert.py
================================================
import numpy as np


def lidar_to_pano_with_intensities_with_bbox_mask(
    local_points_with_intensities: np.ndarray,
    lidar_H: int,
    lidar_W: int,
    lidar_K: int,
    bbox_local: np.ndarray,
    max_depth=80,
    max_intensity=255.0,
):
    """
    Convert lidar frame to pano frame with intensities with bbox_mask.
    Lidar points are in local coordinates.

    Args:
        local_points: (N, 4), float32, in lidar frame, with intensities.
        lidar_H: pano height.
        lidar_W: pano width.
        lidar_K: lidar intrinsics.
        bbox_local: (8x4), world bbox in local.
        max_depth: max depth in meters.
        max_intensity: max intensity.

    Return:
        pano: (H, W), float32.
        intensities: (H, W), float32.
    """

    # Un pack.
    local_points = local_points_with_intensities[:, :3]
    local_point_intensities = local_points_with_intensities[:, 3]
    fov_up, fov = lidar_K
    fov_down = fov - fov_up

    # Compute dists to lidar center.
    dists = np.linalg.norm(local_points, axis=1)

    # Fill pano and intensities.
    pano = np.zeros((lidar_H, lidar_W))
    intensities = np.zeros((lidar_H, lidar_W))

    # bbox mask
    pano[:, :] = -1
    r_min, r_max, c_min, c_max = 1e5, -1, 1e5, -1
    for bbox_local_point in bbox_local:
        x, y, z, _ = bbox_local_point
        beta = np.pi - np.arctan2(y, x)
        alpha = np.arctan2(z, np.sqrt(x**2 + y**2)) + fov_down / 180 * np.pi

        c = int(round(beta / (2 * np.pi / lidar_W)))
        r = int(round(lidar_H - alpha / (fov / 180 * np.pi / lidar_H)))

        # Check out-of-bounds.
        if r >= lidar_H or r < 0 or c >= lidar_W or c < 0:
            continue
        else:
            r_min, r_max, c_min, c_max = (
                min(r_min, r),
                max(r_max, r),
                min(c_min, c),
                max(c_max, c),
            )

    pano[r_min:r_max, c_min:c_max] = 0

    # Fill pano and intensities.
    for local_points, dist, local_point_intensity in zip(
        local_points,
        dists,
        local_point_intensities,
    ):
        # Check max depth.
        if dist >= max_depth:
            continue

        x, y, z = local_points
        beta = np.pi - np.arctan2(y, x)
        alpha = np.arctan2(z, np.sqrt(x**2 + y**2)) + fov_down / 180 * np.pi
        c = int(round(beta / (2 * np.pi / lidar_W)))
        r = int(round(lidar_H - alpha / (fov / 180 * np.pi / lidar_H)))

        # Check out-of-bounds.
        if r >= lidar_H or r < 0 or c >= lidar_W or c < 0:
            continue

        # Set to min dist if not set.
        if pano[r, c] == 0.0:
            pano[r, c] = dist
            intensities[r, c] = local_point_intensity / max_intensity
        elif pano[r, c] > dist:
            pano[r, c] = dist
            intensities[r, c] = local_point_intensity / max_intensity

    return pano, intensities


def lidar_to_pano_with_intensities(
    local_points_with_intensities: np.ndarray,
    lidar_H: int,
    lidar_W: int,
    lidar_K: int,
    max_depth=80,
):
    """
    Convert lidar frame to pano frame with intensities.
    Lidar points are in local coordinates.

    Args:
        local_points: (N, 4), float32, in lidar frame, with intensities.
        lidar_H: pano height.
        lidar_W: pano width.
        lidar_K: lidar intrinsics.
        max_depth: max depth in meters.

    Return:
        pano: (H, W), float32.
        intensities: (H, W), float32.
    """
    # Un pack.
    local_points = local_points_with_intensities[:, :3]
    local_point_intensities = local_points_with_intensities[:, 3]
    fov_up, fov = lidar_K
    fov_down = fov - fov_up

    # Compute dists to lidar center.
    dists = np.linalg.norm(local_points, axis=1)

    # Fill pano and intensities.
    pano = np.zeros((lidar_H, lidar_W))
    intensities = np.zeros((lidar_H, lidar_W))
    for local_points, dist, local_point_intensity in zip(
        local_points,
        dists,
        local_point_intensities,
    ):
        # Check max depth.
        if dist >= max_depth:
            continue

        x, y, z = local_points
        beta = np.pi - np.arctan2(y, x)
        alpha = np.arctan2(z, np.sqrt(x**2 + y**2)) + fov_down / 180 * np.pi
        c = int(round(beta / (2 * np.pi / lidar_W)))
        r = int(round(lidar_H - alpha / (fov / 180 * np.pi / lidar_H)))

        # Check out-of-bounds.
        if r >= lidar_H or r < 0 or c >= lidar_W or c < 0:
            continue

        # Set to min dist if not set.
        if pano[r, c] == 0.0:
            pano[r, c] = dist
            intensities[r, c] = local_point_intensity
        elif pano[r, c] > dist:
            pano[r, c] = dist
            intensities[r, c] = local_point_intensity

    return pano, intensities


def lidar_to_pano(
    local_points: np.ndarray, lidar_H: int, lidar_W: int, lidar_K: int, max_dpeth=80
):
    """
    Convert lidar frame to pano frame. Lidar points are in local coordinates.

    Args:
        local_points: (N, 3), float32, in lidar frame.
        lidar_H: pano height.
        lidar_W: pano width.
        lidar_K: lidar intrinsics.
        max_depth: max depth in meters.

    Return:
        pano: (H, W), float32.
    """

    # (N, 3) -> (N, 4), filled with zeros.
    local_points_with_intensities = np.concatenate(
        [local_points, np.zeros((local_points.shape[0], 1))], axis=1
    )
    pano, _ = lidar_to_pano_with_intensities(
        local_points_with_intensities=local_points_with_intensities,
        lidar_H=lidar_H,
        lidar_W=lidar_W,
        lidar_K=lidar_K,
        max_dpeth=max_dpeth,
    )
    return pano


def pano_to_lidar_with_intensities(pano: np.ndarray, intensities, lidar_K):
    """
    Args:
        pano: (H, W), float32.
        intensities: (H, W), float32.
        lidar_K: lidar intrinsics (fov_up, fov)

    Return:
        local_points_with_intensities: (N, 4), float32, in lidar frame.
    """
    fov_up, fov = lidar_K

    H, W = pano.shape
    i, j = np.meshgrid(
        np.arange(W, dtype=np.float32), np.arange(H, dtype=np.float32), indexing="xy"
    )
    beta = -(i - W / 2) / W * 2 * np.pi
    alpha = (fov_up - j / H * fov) / 180 * np.pi
    dirs = np.stack(
        [
            np.cos(alpha) * np.cos(beta),
            np.cos(alpha) * np.sin(beta),
            np.sin(alpha),
        ],
        -1,
    )
    local_points = dirs * pano.reshape(H, W, 1)

    # local_points: (H, W, 3)
    # intensities : (H, W)
    # local_points_with_intensities: (H, W, 4)
    local_points_with_intensities = np.concatenate(
        [local_points, intensities.reshape(H, W, 1)], axis=2
    )

    # Filter empty points.
    idx = np.where(pano != 0.0)
    local_points_with_intensities = local_points_with_intensities[idx]

    return local_points_with_intensities


def pano_to_lidar(pano, lidar_K):
    """
    Args:
        pano: (H, W), float32.
        lidar_K: lidar intrinsics (fov_up, fov)

    Return:
        local_points: (N, 3), float32, in lidar frame.
    """
    local_points_with_intensities = pano_to_lidar_with_intensities(
        pano=pano,
        intensities=np.zeros_like(pano),
        lidar_K=lidar_K,
    )
    return local_points_with_intensities[:, :3]


def lidar_to_pano_with_intensities_fpa(
    local_points_with_intensities: np.ndarray,
    lidar_H: int,
    lidar_W: int,
    lidar_K: int,
    max_depth=80,
    z_buffer_len=10,
):
    """
    Convert lidar frame to pano frame with intensities with bbox_mask.
    Lidar points are in local coordinates.

    Args:
        local_points: (N, 4), float32, in lidar frame, with intensities.
        lidar_H: pano height.
        lidar_W: pano width.
        lidar_K: lidar intrinsics.
        max_depth: max depth in meters.
        z_buffer_len: length of the z_buffer.

    Return:
        rangeview image: (H, W, 3), float32.
    """

    # Un pack.
    local_points = local_points_with_intensities[:, :3]
    local_point_intensities = local_points_with_intensities[:, 3]
    fov_up, fov = lidar_K
    fov_down = fov - fov_up

    # Compute dists to lidar center.
    dists = np.linalg.norm(local_points, axis=1)

    # Fill pano and intensities.
    range_view = np.zeros((lidar_H, lidar_W, 3, z_buffer_len + 1))

    for local_point, dist, local_point_intensity in zip(
        local_points,
        dists,
        local_point_intensities,
    ):
        # Check max depth.
        if dist >= max_depth:
            continue

        x, y, z = local_point
        beta = np.pi - np.arctan2(y, x)
        alpha = np.arctan2(z, np.sqrt(x**2 + y**2)) + fov_down / 180 * np.pi
        c = int(round(beta / (2 * np.pi / lidar_W)))
        r = int(round(lidar_H - alpha / (fov / 180 * np.pi / lidar_H)))

        if r >= lidar_H or r < 0 or c >= lidar_W or c < 0:
            continue

        position = range_view[r, c, 2, 0] + 1
        if position > z_buffer_len:
            depth_z_buffer = list(range_view[r, c, 2][1:]) + [dist]
            intensity_z_buffer = list(range_view[r, c, 1][1:]) + [local_point_intensity]
            position = position - 1

            sort_index = np.argsort(depth_z_buffer)
            depth_z_buffer = np.insert(
                np.array(depth_z_buffer)[sort_index][:z_buffer_len], 0, position
            )
            intensity_z_buffer = np.insert(
                np.array(intensity_z_buffer)[sort_index][:z_buffer_len], 0, position
            )
            range_view[r, c, 2] = depth_z_buffer
            range_view[r, c, 1] = intensity_z_buffer

        else:
            range_view[r, c, 2, int(position)] = dist
            range_view[r, c, 1, int(position)] = local_point_intensity
        range_view[r, c, 2, 0] = position
    range_view = parse_z_buffer(range_view, lidar_H, lidar_W)
    return range_view[:, :, 2], range_view[:, :, 1]


def parse_z_buffer(novel_pano, lidar_H, lidar_W, threshold=0.2):
    range_view = np.zeros((lidar_H, lidar_W, 3))
    for i in range(lidar_H):
        for j in range(lidar_W):
            range_pixel = novel_pano[i, j, 2]
            intensity_pixel = novel_pano[i, j, 1]
            z_buffer_num = int(range_pixel[0])
            if z_buffer_num == 0:
                continue
            if z_buffer_num == 1:
                range_view[i][j][2] = range_pixel[1]
                range_view[i][j][1] = intensity_pixel[1]
                continue

            depth_z_buffer = range_pixel[1:z_buffer_num]
            cloest_points = min(depth_z_buffer)
            index = depth_z_buffer <= (cloest_points + threshold)

            final_depth_z_buffer = np.array(depth_z_buffer)[index]
            final_dis = np.average(
                final_depth_z_buffer, weights=1 / final_depth_z_buffer
            )
            range_view[i][j][2] = final_dis

            intensity_z_buffer = intensity_pixel[1:z_buffer_num]
            final_intensity_z_buffer = np.array(intensity_z_buffer)[index]
            final_intensity = np.average(
                final_intensity_z_buffer, weights=1 / final_depth_z_buffer
            )
            range_view[i][j][1] = final_intensity
    return range_view


================================================
FILE: lidarnerf/dataset/base_dataset.py
================================================
import numpy as np
import torch
import trimesh
from packaging import version as pver
from dataclasses import dataclass


def custom_meshgrid(*args):
    if pver.parse(torch.__version__) < pver.parse("1.10"):
        return torch.meshgrid(*args)
    else:
        return torch.meshgrid(*args, indexing="ij")


@torch.cuda.amp.autocast(enabled=False)
def get_lidar_rays(poses, intrinsics, H, W, N=-1, patch_size=1):
    """
    Get lidar rays.

    Args:
        poses: [B, 4, 4], cam2world
        intrinsics: [2]
        H, W, N: int
    Returns:
        rays_o, rays_d: [B, N, 3]
        inds: [B, N]
    """
    device = poses.device
    B = poses.shape[0]

    i, j = custom_meshgrid(
        torch.linspace(0, W - 1, W, device=device),
        torch.linspace(0, H - 1, H, device=device),
    )  # float
    # i = i.t().reshape([1, H * W]).expand([B, H * W]) + 0.5
    # j = j.t().reshape([1, H * W]).expand([B, H * W]) + 0.5
    i = i.t().reshape([1, H * W]).expand([B, H * W])
    j = j.t().reshape([1, H * W]).expand([B, H * W])
    results = {}
    if N > 0:
        N = min(N, H * W)

        if isinstance(patch_size, int):
            patch_size_x, patch_size_y = patch_size, patch_size
        elif len(patch_size) == 1:
            patch_size_x, patch_size_y = patch_size[0], patch_size[0]
        else:
            patch_size_x, patch_size_y = patch_size

        if patch_size_x > 0:
            # random sample left-top cores.
            # NOTE: this impl will lead to less sampling on the image corner
            # pixels... but I don't have other ideas.
            num_patch = N // (patch_size_x * patch_size_y)
            inds_x = torch.randint(0, H - patch_size_x, size=[num_patch], device=device)
            inds_y = torch.randint(0, W - patch_size_y, size=[num_patch], device=device)
            inds = torch.stack([inds_x, inds_y], dim=-1)  # [np, 2]

            # create meshgrid for each patch
            pi, pj = custom_meshgrid(
                torch.arange(patch_size_x, device=device),
                torch.arange(patch_size_y, device=device),
            )
            offsets = torch.stack([pi.reshape(-1), pj.reshape(-1)], dim=-1)  # [p^2, 2]

            inds = inds.unsqueeze(1) + offsets.unsqueeze(0)  # [np, p^2, 2]
            inds = inds.view(-1, 2)  # [N, 2]
            inds = inds[:, 0] * W + inds[:, 1]  # [N], flatten

            inds = inds.expand([B, N])

        else:
            inds = torch.randint(0, H * W, size=[N], device=device)  # may duplicate
            inds = inds.expand([B, N])

        i = torch.gather(i, -1, inds)
        j = torch.gather(j, -1, inds)

        results["inds"] = inds

    else:
        inds = torch.arange(H * W, device=device).expand([B, H * W])
        results["inds"] = inds

    fov_up, fov = intrinsics
    beta = -(i - W / 2) / W * 2 * np.pi
    alpha = (fov_up - j / H * fov) / 180 * np.pi

    directions = torch.stack(
        [
            torch.cos(alpha) * torch.cos(beta),
            torch.cos(alpha) * torch.sin(beta),
            torch.sin(alpha),
        ],
        -1,
    )
    # directions = directions / torch.norm(directions, dim=-1, keepdim=True)
    rays_d = directions @ poses[:, :3, :3].transpose(-1, -2)  # (B, N, 3)
    rays_o = poses[..., :3, 3]  # [B, 3]
    rays_o = rays_o[..., None, :].expand_as(rays_d)  # [B, N, 3]

    results["rays_o"] = rays_o
    results["rays_d"] = rays_d

    return results


@torch.cuda.amp.autocast(enabled=False)
def get_rays(poses, intrinsics, H, W, N=-1, patch_size=1):
    """get rays
    Args:
        poses: [B, 4, 4], cam2world
        intrinsics: [4]
        H, W, N: int
    Returns:
        rays_o, rays_d: [B, N, 3]
        inds: [B, N]
    """

    device = poses.device
    B = poses.shape[0]
    fx, fy, cx, cy = intrinsics

    i, j = custom_meshgrid(
        torch.linspace(0, W - 1, W, device=device),
        torch.linspace(0, H - 1, H, device=device),
    )  # float
    i = i.t().reshape([1, H * W]).expand([B, H * W]) + 0.5
    j = j.t().reshape([1, H * W]).expand([B, H * W]) + 0.5

    results = {}
    if N > 0:
        N = min(N, H * W)

        if patch_size > 1:
            # random sample left-top cores.
            # NOTE: this impl will lead to less sampling on the image corner
            # pixels... but I don't have other ideas.
            num_patch = N // (patch_size**2)
            inds_x = torch.randint(0, H - patch_size, size=[num_patch], device=device)
            inds_y = torch.randint(0, W - patch_size, size=[num_patch], device=device)
            inds = torch.stack([inds_x, inds_y], dim=-1)  # [np, 2]

            # create meshgrid for each patch
            pi, pj = custom_meshgrid(
                torch.arange(patch_size, device=device),
                torch.arange(patch_size, device=device),
            )
            offsets = torch.stack([pi.reshape(-1), pj.reshape(-1)], dim=-1)  # [p^2, 2]

            inds = inds.unsqueeze(1) + offsets.unsqueeze(0)  # [np, p^2, 2]
            inds = inds.view(-1, 2)  # [N, 2]
            inds = inds[:, 0] * W + inds[:, 1]  # [N], flatten

            inds = inds.expand([B, N])

        else:
            inds = torch.randint(0, H * W, size=[N], device=device)  # may duplicate
            inds = inds.expand([B, N])

        i = torch.gather(i, -1, inds)
        j = torch.gather(j, -1, inds)

        results["inds"] = inds

    else:
        inds = torch.arange(H * W, device=device).expand([B, H * W])

    zs = torch.ones_like(i)
    xs = (i - cx) / fx * zs
    ys = (j - cy) / fy * zs
    directions = torch.stack((xs, ys, zs), dim=-1)
    directions = directions / torch.norm(directions, dim=-1, keepdim=True)
    rays_d = directions @ poses[:, :3, :3].transpose(-1, -2)  # (B, N, 3)

    rays_o = poses[..., :3, 3]  # [B, 3]
    rays_o = rays_o[..., None, :].expand_as(rays_d)  # [B, N, 3]

    results["rays_o"] = rays_o
    results["rays_d"] = rays_d

    return results


# ref: https://github.com/NVlabs/instant-ngp/blob/b76004c8cf478880227401ae763be4c02f80b62f/include/neural-graphics-primitives/nerf_loader.h#L50
def nerf_matrix_to_ngp(pose, scale=0.33, offset=[0, 0, 0]):
    # for the fox dataset, 0.33 scales camera radius to ~ 2
    new_pose = np.array(
        [
            [pose[1, 0], -pose[1, 1], -pose[1, 2], pose[1, 3] * scale + offset[0]],
            [pose[2, 0], -pose[2, 1], -pose[2, 2], pose[2, 3] * scale + offset[1]],
            [pose[0, 0], -pose[0, 1], -pose[0, 2], pose[0, 3] * scale + offset[2]],
            [0, 0, 0, 1],
        ],
        dtype=np.float32,
    )
    return new_pose


def visualize_poses(poses, size=0.1):
    # poses: [B, 4, 4]

    axes = trimesh.creation.axis(axis_length=4)
    box = trimesh.primitives.Box(extents=(2, 2, 2)).as_outline()
    box.colors = np.array([[128, 128, 128]] * len(box.entities))
    objects = [axes, box]

    for pose in poses:
        # a camera is visualized with 8 line segments.
        pos = pose[:3, 3]
        a = pos + size * pose[:3, 0] + size * pose[:3, 1] + size * pose[:3, 2]
        b = pos - size * pose[:3, 0] + size * pose[:3, 1] + size * pose[:3, 2]
        c = pos - size * pose[:3, 0] - size * pose[:3, 1] + size * pose[:3, 2]
        d = pos + size * pose[:3, 0] - size * pose[:3, 1] + size * pose[:3, 2]

        dir = (a + b + c + d) / 4 - pos
        dir = dir / (np.linalg.norm(dir) + 1e-8)
        o = pos + dir * 3

        segs = np.array(
            [
                [pos, a],
                [pos, b],
                [pos, c],
                [pos, d],
                [a, b],
                [b, c],
                [c, d],
                [d, a],
                [pos, o],
            ]
        )
        segs = trimesh.load_path(segs)
        objects.append(segs)

    trimesh.Scene(objects).show()


@dataclass
class BaseDataset:
    pass


================================================
FILE: lidarnerf/dataset/kitti360_dataset.py
================================================
import json
import os

import numpy as np
import torch
import tqdm
from torch.utils.data import DataLoader
from dataclasses import dataclass, field

from lidarnerf.dataset.base_dataset import get_lidar_rays, BaseDataset


@dataclass
class KITTI360Dataset(BaseDataset):
    device: str = "cpu"
    split: str = "train"  # train, val, test
    root_path: str = "data/kitti360"
    sequence_id: str = "1908"
    preload: bool = True  # preload data into GPU
    scale: float = (
        1  # camera radius scale to make sure camera are inside the bounding box.
    )
    offset: list = field(default_factory=list)  # offset
    # bound = opt.bound  # bounding box half length, also used as the radius to random sample poses.
    fp16: bool = True  # if preload, load into fp16.
    patch_size: int = 1  # size of the image to extract from the scene.
    patch_size_lidar: int = 1  # size of the image to extract from the Lidar.
    enable_lidar: bool = True
    num_rays: int = 4096
    num_rays_lidar: int = 4096

    def __post_init__(self):
        if self.sequence_id == "1538":
            print("Using sqequence 1538-1601")
        elif self.sequence_id == "1728":
            print("Using sqequence 1728-1791")
        elif self.sequence_id == "1908":
            print("Using sqequence 1908-1971")
        elif self.sequence_id == "3353":
            print("Using sqequence 3353-3416")
        else:
            raise ValueError(f"Invalid sequence id: {sequence_id}")

        self.training = self.split in ["train", "all", "trainval"]
        self.num_rays = self.num_rays if self.training else -1
        self.num_rays_lidar = self.num_rays_lidar if self.training else -1
        # load nerf-compatible format data.
        with open(
            os.path.join(
                self.root_path, f"transforms_{self.sequence_id}_{self.split}.json"
            ),
            "r",
        ) as f:
            transform = json.load(f)

        # load image size
        if "h" in transform and "w" in transform:
            self.H = int(transform["h"])
            self.W = int(transform["w"])
        else:
            # we have to actually read an image to get H and W later.
            self.H = self.W = None

        if "h_lidar" in transform and "w_lidar" in transform:
            self.H_lidar = int(transform["h_lidar"])
            self.W_lidar = int(transform["w_lidar"])

        # read images
        frames = transform["frames"]
        # frames = sorted(frames, key=lambda d: d['file_path']) # why do I sort...

        self.poses_lidar = []
        self.images_lidar = []
        for f in tqdm.tqdm(frames, desc=f"Loading {self.split} data"):
            pose_lidar = np.array(f["lidar2world"], dtype=np.float32)  # [4, 4]

            f_lidar_path = os.path.join(self.root_path, f["lidar_file_path"])

            # channel1 None, channel2 intensity , channel3 depth
            pc = np.load(f_lidar_path)
            ray_drop = np.where(pc.reshape(-1, 3)[:, 2] == 0.0, 0.0, 1.0).reshape(
                self.H_lidar, self.W_lidar, 1
            )

            image_lidar = np.concatenate(
                [ray_drop, pc[:, :, 1, None], pc[:, :, 2, None] * self.scale],
                axis=-1,
            )

            self.poses_lidar.append(pose_lidar)
            self.images_lidar.append(image_lidar)

        self.poses_lidar = np.stack(self.poses_lidar, axis=0)
        self.poses_lidar[:, :3, -1] = (
            self.poses_lidar[:, :3, -1] - self.offset
        ) * self.scale
        self.poses_lidar = torch.from_numpy(self.poses_lidar)  # [N, 4, 4]

        if self.images_lidar is not None:
            self.images_lidar = torch.from_numpy(
                np.stack(self.images_lidar, axis=0)
            ).float()  # [N, H, W, C]

        # calculate mean radius of all camera poses
        # self.radius = self.poses[:, :3, 3].norm(dim=-1).mean(0).item()
        # print(f'[INFO] dataset camera poses: radius = {self.radius:.4f}, bound = {self.bound}')

        # [debug] uncomment to view all training poses.
        # visualize_poses(self.poses.numpy())

        if self.preload:
            self.poses_lidar = self.poses_lidar.to(self.device)
            if self.images_lidar is not None:
                # TODO: linear use pow, but pow for half is only available for torch >= 1.10 ?
                if self.fp16:
                    dtype = torch.half
                else:
                    dtype = torch.float
                self.images_lidar = self.images_lidar.to(dtype).to(self.device)

        self.intrinsics_lidar = (2.0, 26.9)  # fov_up, fov

    def collate(self, index):
        B = len(index)  # a list of length 1

        results = {}

        if self.enable_lidar:
            poses_lidar = self.poses_lidar[index].to(self.device)  # [B, 4, 4]
            rays_lidar = get_lidar_rays(
                poses_lidar,
                self.intrinsics_lidar,
                self.H_lidar,
                self.W_lidar,
                self.num_rays_lidar,
                self.patch_size_lidar,
            )

            results.update(
                {
                    "H_lidar": self.H_lidar,
                    "W_lidar": self.W_lidar,
                    "rays_o_lidar": rays_lidar["rays_o"],
                    "rays_d_lidar": rays_lidar["rays_d"],
                }
            )

        if self.images_lidar is not None and self.enable_lidar:
            images_lidar = self.images_lidar[index].to(self.device)  # [B, H, W, 3/4]
            if self.training:
                C = images_lidar.shape[-1]
                images_lidar = torch.gather(
                    images_lidar.view(B, -1, C),
                    1,
                    torch.stack(C * [rays_lidar["inds"]], -1),
                )  # [B, N, 3/4]
            results["images_lidar"] = images_lidar

        return results

    def dataloader(self):
        size = len(self.poses_lidar)
        loader = DataLoader(
            list(range(size)),
            batch_size=1,
            collate_fn=self.collate,
            shuffle=self.training,
            num_workers=0,
        )
        loader._data = self
        loader.has_gt = self.images_lidar is not None
        return loader

    def __len__(self):
        """
        Returns # of frames in this dataset.
        """
        num_frames = len(self.poses_lidar)
        return num_frames


================================================
FILE: lidarnerf/dataset/nerfmvl_dataset.py
================================================
import os
import json
import tqdm
import numpy as np

import torch
from torch.utils.data import DataLoader
from dataclasses import dataclass, field

from lidarnerf.dataset.base_dataset import get_lidar_rays, BaseDataset


@dataclass
class NeRFMVLDataset(BaseDataset):
    device: str = "cpu"
    split: str = "train"  # train, val, test
    root_path: str = "data/kitti360"
    sequence_id: str = "car"
    preload: bool = True  # preload data into GPU
    scale: float = (
        1  # camera radius scale to make sure camera are inside the bounding box.
    )
    offset: list = field(default_factory=list)  # offset
    # bound = opt.bound  # bounding box half length, also used as the radius to random sample poses.
    fp16: bool = True  # if preload, load into fp16.
    patch_size: int = 1  # size of the image to extract from the scene.
    patch_size_lidar: int = 1  # size of the image to extract from the Lidar.
    enable_lidar: bool = True
    num_rays: int = 4096
    num_rays_lidar: int = 4096

    def __post_init__(self):
        self.class_name = self.sequence_id
        self.training = self.split in ["train", "all", "trainval"]
        self.testing = self.split in ["test"]
        self.num_rays = self.num_rays if self.training else -1
        self.num_rays_lidar = self.num_rays_lidar if self.training else -1

        with open(
            os.path.join(
                self.root_path, f"transforms_{self.class_name}_{self.split}.json"
            ),
            "r",
        ) as f:
            transform = json.load(f)

        if "h_lidar" in transform and "w_lidar" in transform:
            self.H_lidar = int(transform["h_lidar"])
            self.W_lidar = int(transform["w_lidar"])

        # read images
        frames = transform["frames"]
        # frames = sorted(frames, key=lambda d: d['file_path']) # why do I sort...

        self.poses_lidar = []
        self.images_lidar = []
        for f in tqdm.tqdm(frames, desc=f"Loading {self.split} data"):
            pose_lidar = np.array(f["lidar2world"], dtype=np.float32)  # [4, 4]
            self.poses_lidar.append(pose_lidar)
            if "lidar_file_path" in f.keys():
                f_lidar_path = os.path.join(self.root_path, f["lidar_file_path"])
                # channel1 None, channel2 intensity , channel3 depth
                pc = np.load(f_lidar_path)["data"]

                # ray_drop = np.where(pc.reshape(-1, 3)[:, 2] == 0.0, 0.0,
                #                     1.0).reshape(self.H_lidar, self.W_lidar, 1)
                ray_drop = pc.reshape(-1, 3)[:, 2].copy()
                ray_drop[ray_drop > 0] = 1.0
                ray_drop = ray_drop.reshape(self.H_lidar, self.W_lidar, 1)
                image_lidar = np.concatenate(
                    [ray_drop, pc[:, :, 1, None], pc[:, :, 2, None] * self.scale],
                    axis=-1,
                )

                self.images_lidar.append(image_lidar)
            else:
                self.images_lidar = None

        dataset_bbox = np.load(
            os.path.join(self.root_path, "dataset_bbox_7k.npy"), allow_pickle=True
        ).item()
        self.OBB = dataset_bbox[self.class_name]

        self.offset = np.mean(self.OBB, axis=0)

        self.poses_lidar = np.stack(self.poses_lidar, axis=0)
        self.poses_lidar_wo_scale_offset = self.poses_lidar.copy()
        self.OBB_pad = np.concatenate([self.OBB, np.ones((8, 1))], axis=1)
        self.OBB_local = [
            self.OBB_pad @ np.linalg.inv(pose_lidar_wo_scale_offset.reshape(4, 4)).T
            for pose_lidar_wo_scale_offset in self.poses_lidar_wo_scale_offset
        ]
        self.OBB_local = np.stack(self.OBB_local, axis=0)
        self.poses_lidar[:, :3, -1] = (
            self.poses_lidar[:, :3, -1] - self.offset
        ) * self.scale
        self.poses_lidar = torch.from_numpy(self.poses_lidar)  # [N, 4, 4]

        if self.images_lidar is not None:
            self.images_lidar = torch.from_numpy(
                np.stack(self.images_lidar, axis=0)
            ).float()  # [N, H, W, C]

        if self.preload:
            self.poses_lidar = self.poses_lidar.to(self.device)
            if self.images_lidar is not None:
                # TODO: linear use pow, but pow for half is only available for torch >= 1.10 ?
                if self.fp16:
                    dtype = torch.half
                else:
                    dtype = torch.float
                self.images_lidar = self.images_lidar.to(dtype).to(self.device)

        self.intrinsics_lidar = (15, 40)  # fov_up, fov

    def collate(self, index):
        B = len(index)  # a list of length 1

        results = {}
        if self.enable_lidar:
            poses_lidar = self.poses_lidar[index].to(self.device)  # [B, 4, 4]
            rays_lidar = get_lidar_rays(
                poses_lidar,
                self.intrinsics_lidar,
                self.H_lidar,
                self.W_lidar,
                -1,
                self.patch_size_lidar,
            )

            results.update(
                {
                    "H_lidar": self.H_lidar,
                    "W_lidar": self.W_lidar,
                    "rays_o_lidar": rays_lidar["rays_o"],
                    "rays_d_lidar": rays_lidar["rays_d"],
                }
            )

        if self.testing:
            results["OBB_local"] = self.OBB_local[index].reshape(8, 4)

        if self.images_lidar is not None and self.enable_lidar:
            images_lidar = self.images_lidar[index].to(self.device)  # [B, H, W, 3/4]
            if self.training:
                C = images_lidar.shape[-1]
                images_lidar = torch.gather(
                    images_lidar.view(B, -1, C),
                    1,
                    torch.stack(C * [rays_lidar["inds"]], -1),
                )  # [B, N, 3/4]
                mask = images_lidar[:, :, 0] > -1
                results["images_lidar"] = images_lidar[mask].view(B, -1, C)
                results["rays_o_lidar"] = results["rays_o_lidar"][mask].view(B, -1, 3)
                results["rays_d_lidar"] = results["rays_d_lidar"][mask].view(B, -1, 3)
                valid_num_rays = results["rays_o_lidar"].shape[1]
                if valid_num_rays > self.num_rays_lidar:
                    # mask_inds = torch.randint(0, valid_num_rays, size=[self.num_rays_lidar], device=self.device)
                    mask_inds = torch.randperm(valid_num_rays)[: self.num_rays_lidar]
                    results["images_lidar"] = results["images_lidar"][
                        :, mask_inds, :
                    ].view(B, -1, C)
                    results["rays_o_lidar"] = results["rays_o_lidar"][
                        :, mask_inds, :
                    ].view(B, -1, 3)
                    results["rays_d_lidar"] = results["rays_d_lidar"][
                        :, mask_inds, :
                    ].view(B, -1, 3)
            else:
                results["images_lidar"] = images_lidar

        return results

    def dataloader(self):
        size = len(self.poses_lidar)
        loader = DataLoader(
            list(range(size)),
            batch_size=1,
            collate_fn=self.collate,
            shuffle=self.training,
            num_workers=0,
        )
        loader._data = self
        loader.has_gt = self.images_lidar is not None
        return loader

    def __len__(self):
        """
        Returns # of frames in this dataset.
        """
        num_frames = len(self.poses_lidar)
        return num_frames


================================================
FILE: lidarnerf/encoding.py
================================================
import torch
import torch.nn as nn
import numpy as np


class FreqEncoder(nn.Module):
    def __init__(
        self,
        input_dim,
        max_freq_log2,
        N_freqs,
        log_sampling=True,
        include_input=True,
        periodic_fns=(torch.sin, torch.cos),
    ):
        super().__init__()

        self.input_dim = input_dim
        self.include_input = include_input
        self.periodic_fns = periodic_fns

        self.output_dim = 0
        if self.include_input:
            self.output_dim += self.input_dim

        self.output_dim += self.input_dim * N_freqs * len(self.periodic_fns)

        if log_sampling:
            self.freq_bands = 2.0 ** torch.linspace(0.0, max_freq_log2, N_freqs)
        else:
            self.freq_bands = torch.linspace(2.0**0.0, 2.0**max_freq_log2, N_freqs)

        self.freq_bands = self.freq_bands.numpy().tolist()

    def forward(self, input, **kwargs):
        out = []
        if self.include_input:
            out.append(input)

        for i in range(len(self.freq_bands)):
            freq = self.freq_bands[i]
            for p_fn in self.periodic_fns:
                out.append(p_fn(input * freq))

        out = torch.cat(out, dim=-1)

        return out


def get_encoder(
    encoding,
    input_dim=3,
    multires=6,
    degree=4,
    num_levels=16,
    level_dim=2,
    base_resolution=16,
    log2_hashmap_size=19,
    desired_resolution=2048,
    align_corners=False,
    **kwargs
):
    if encoding == "None":
        return lambda x, **kwargs: x, input_dim

    elif encoding == "frequency":
        # encoder = FreqEncoder(input_dim=input_dim, max_freq_log2=multires-1, N_freqs=multires, log_sampling=True)
        from freqencoder import FreqEncoder

        encoder = FreqEncoder(input_dim=input_dim, degree=multires)

    elif encoding == "sphere_harmonics":
        from shencoder import SHEncoder

        encoder = SHEncoder(input_dim=input_dim, degree=degree)

    elif encoding == "hashgrid":
        from gridencoder import GridEncoder

        encoder = GridEncoder(
            input_dim=input_dim,
            num_levels=num_levels,
            level_dim=level_dim,
            base_resolution=base_resolution,
            log2_hashmap_size=log2_hashmap_size,
            desired_resolution=desired_resolution,
            gridtype="hash",
            align_corners=align_corners,
        )

    elif encoding == "tiledgrid":
        from gridencoder import GridEncoder

        encoder = GridEncoder(
            input_dim=input_dim,
            num_levels=num_levels,
            level_dim=level_dim,
            base_resolution=base_resolution,
            log2_hashmap_size=log2_hashmap_size,
            desired_resolution=desired_resolution,
            gridtype="tiled",
            align_corners=align_corners,
        )

    elif encoding == "ash":
        from ashencoder import AshEncoder

        encoder = AshEncoder(
            input_dim=input_dim,
            output_dim=16,
            log2_hashmap_size=log2_hashmap_size,
            resolution=desired_resolution,
        )

    else:
        raise NotImplementedError(
            "Unknown encoding mode, choose from [None, frequency, sphere_harmonics, hashgrid, tiledgrid]"
        )

    return encoder, encoder.output_dim


class PeriodicVolumeEncoding(nn.Module):
    """Periodic Volume encoding

    Args:
        num_levels: Number of feature grids.
        min_res: Resolution of smallest feature grid.
        max_res: Resolution of largest feature grid.
        log2_hashmap_size: Size of hash map is 2^log2_hashmap_size.
        features_per_level: Number of features per level.
        hash_init_scale: Value to initialize hash grid.
        implementation: Implementation of hash encoding. Fallback to torch if tcnn not available.
    """

    def __init__(
        self,
        num_levels: int = 16,
        min_res: int = 16,
        max_res: int = 1024,
        log2_hashmap_size: int = 19,
        features_per_level: int = 2,
        hash_init_scale: float = 0.001,
        smoothstep: bool = False,
    ) -> None:
        super(PeriodicVolumeEncoding, self).__init__()
        self.in_dim = 3
        self.num_levels = num_levels
        self.features_per_level = features_per_level
        self.log2_hashmap_size = log2_hashmap_size
        assert log2_hashmap_size % 3 == 0
        self.hash_table_size = 2**log2_hashmap_size
        self.n_output_dims = num_levels * features_per_level
        self.smoothstep = smoothstep

        levels = torch.arange(num_levels)
        growth_factor = np.exp((np.log(max_res) - np.log(min_res)) / (num_levels - 1))
        self.scalings = torch.floor(min_res * growth_factor**levels)

        self.periodic_volume_resolution = 2 ** (log2_hashmap_size // 3)
        # self.periodic_resolution = torch.minimum(torch.floor(self.scalings), periodic_volume_resolution)

        self.hash_offset = levels * self.hash_table_size
        self.hash_table = (
            torch.rand(size=(self.hash_table_size * num_levels, features_per_level)) * 2
            - 1
        )
        self.hash_table *= hash_init_scale
        self.hash_table = nn.Parameter(self.hash_table)

        # TODO weight loss by level?
        self.per_level_weights = 1.0

    def parameters(self):
        return self.hash_table

    def get_out_dim(self) -> int:
        return self.num_levels * self.features_per_level

    def hash_fn(self, in_tensor):
        """Returns hash tensor using method described in Instant-NGP

        Args:
            in_tensor: Tensor to be hashed
        """

        # round to make it perioidic
        x = in_tensor
        x %= self.periodic_volume_resolution
        # xyz to index
        x = (
            x[..., 0] * (self.periodic_volume_resolution**2)
            + x[..., 1] * (self.periodic_volume_resolution)
            + x[..., 2]
        )
        # offset by feature levels
        x += self.hash_offset.to(x.device)

        return x.long()

    def pytorch_fwd(self, in_tensor):
        """Forward pass using pytorch. Significantly slower than TCNN implementation."""

        assert in_tensor.shape[-1] == 3
        in_tensor = in_tensor[..., None, :]  # [..., 1, 3]
        scaled = in_tensor * self.scalings.view(-1, 1).to(
            in_tensor.device
        )  # [..., L, 3]
        scaled_c = torch.ceil(scaled).type(torch.int32)
        scaled_f = torch.floor(scaled).type(torch.int32)

        offset = scaled - scaled_f

        if self.smoothstep:
            offset = offset * offset * (3.0 - 2.0 * offset)

        hashed_0 = self.hash_fn(scaled_c)  # [..., num_levels]
        hashed_1 = self.hash_fn(
            torch.cat(
                [scaled_c[..., 0:1], scaled_f[..., 1:2], scaled_c[..., 2:3]], dim=-1
            )
        )
        hashed_2 = self.hash_fn(
            torch.cat(
                [scaled_f[..., 0:1], scaled_f[..., 1:2], scaled_c[..., 2:3]], dim=-1
            )
        )
        hashed_3 = self.hash_fn(
            torch.cat(
                [scaled_f[..., 0:1], scaled_c[..., 1:2], scaled_c[..., 2:3]], dim=-1
            )
        )
        hashed_4 = self.hash_fn(
            torch.cat(
                [scaled_c[..., 0:1], scaled_c[..., 1:2], scaled_f[..., 2:3]], dim=-1
            )
        )
        hashed_5 = self.hash_fn(
            torch.cat(
                [scaled_c[..., 0:1], scaled_f[..., 1:2], scaled_f[..., 2:3]], dim=-1
            )
        )
        hashed_6 = self.hash_fn(scaled_f)
        hashed_7 = self.hash_fn(
            torch.cat(
                [scaled_f[..., 0:1], scaled_c[..., 1:2], scaled_f[..., 2:3]], dim=-1
            )
        )

        f_0 = self.hash_table[hashed_0]  # [..., num_levels, features_per_level]
        f_1 = self.hash_table[hashed_1]
        f_2 = self.hash_table[hashed_2]
        f_3 = self.hash_table[hashed_3]
        f_4 = self.hash_table[hashed_4]
        f_5 = self.hash_table[hashed_5]
        f_6 = self.hash_table[hashed_6]
        f_7 = self.hash_table[hashed_7]

        f_03 = f_0 * offset[..., 0:1] + f_3 * (1 - offset[..., 0:1])
        f_12 = f_1 * offset[..., 0:1] + f_2 * (1 - offset[..., 0:1])
        f_56 = f_5 * offset[..., 0:1] + f_6 * (1 - offset[..., 0:1])
        f_47 = f_4 * offset[..., 0:1] + f_7 * (1 - offset[..., 0:1])

        f0312 = f_03 * offset[..., 1:2] + f_12 * (1 - offset[..., 1:2])
        f4756 = f_47 * offset[..., 1:2] + f_56 * (1 - offset[..., 1:2])

        encoded_value = f0312 * offset[..., 2:3] + f4756 * (
            1 - offset[..., 2:3]
        )  # [..., num_levels, features_per_level]

        return torch.flatten(
            encoded_value, start_dim=-2, end_dim=-1
        )  # [..., num_levels * features_per_level]

    def forward(self, in_tensor):
        return self.pytorch_fwd(in_tensor)

    def get_total_variation_loss(self):
        """Compute the total variation loss for the feature volume."""
        feature_volume = self.hash_table.reshape(
            self.num_levels,
            self.periodic_volume_resolution,
            self.periodic_volume_resolution,
            self.periodic_volume_resolution,
            self.features_per_level,
        )
        diffx = feature_volume[:, 1:, :, :, :] - feature_volume[:, :-1, :, :, :]
        diffy = feature_volume[:, :, 1:, :, :] - feature_volume[:, :, :-1, :, :]
        diffz = feature_volume[:, :, :, 1:, :] - feature_volume[:, :, :, :-1, :]

        # TODO how to sum here or should we use mask?
        resx = diffx.abs().mean(dim=(1, 2, 3, 4))
        resy = diffy.abs().mean(dim=(1, 2, 3, 4))
        resz = diffz.abs().mean(dim=(1, 2, 3, 4))

        return ((resx + resy + resz) * self.per_level_weights).mean()


================================================
FILE: lidarnerf/ffmlp/__init__.py
================================================


================================================
FILE: lidarnerf/ffmlp/backend.py
================================================
import os
from torch.utils.cpp_extension import load

_src_path = os.path.dirname(os.path.abspath(__file__))

nvcc_flags = [
    "-O3",
    "-std=c++14",
    "--expt-extended-lambda",
    "--expt-relaxed-constexpr",
    "-U__CUDA_NO_HALF_OPERATORS__",
    "-U__CUDA_NO_HALF_CONVERSIONS__",
    "-U__CUDA_NO_HALF2_OPERATORS__",
]

if os.name == "posix":
    nvcc_flags += [
        "-Xcompiler=-mf16c",
        "-Xcompiler=-Wno-float-conversion",
        "-Xcompiler=-fno-strict-aliasing",
    ]
    c_flags = ["-O3", "-std=c++14"]
elif os.name == "nt":
    c_flags = ["/O2", "/std:c++17"]

    # find cl.exe
    def find_cl_path():
        import glob

        for edition in ["Enterprise", "Professional", "BuildTools", "Community"]:
            paths = sorted(
                glob.glob(
                    r"C:\\Program Files (x86)\\Microsoft Visual Studio\\*\\%s\\VC\\Tools\\MSVC\\*\\bin\\Hostx64\\x64"
                    % edition
                ),
                reverse=True,
            )
            if paths:
                return paths[0]

    # If cl.exe is not on path, try to find it.
    if os.system("where cl.exe >nul 2>nul") != 0:
        cl_path = find_cl_path()
        if cl_path is None:
            raise RuntimeError(
                "Could not locate a supported Microsoft Visual C++ installation"
            )
        os.environ["PATH"] += ";" + cl_path

_backend = load(
    name="_ffmlp",
    extra_cflags=c_flags,
    extra_cuda_cflags=nvcc_flags,
    extra_include_paths=[
        os.path.join(_src_path, "dependencies/cutlass/include"),
        os.path.join(_src_path, "dependencies/cutlass/tools/util/include"),
    ],
    sources=[
        os.path.join(_src_path, "src", f)
        for f in [
            "ffmlp.cu",
            "bindings.cpp",
        ]
    ],
)

__all__ = ["_backend"]


================================================
FILE: lidarnerf/ffmlp/ffmlp.py
================================================
import math

import torch
import torch.nn as nn
from torch.autograd import Function
from torch.cuda.amp import custom_bwd, custom_fwd

try:
    import _ffmlp as _backend
except ImportError:
    from .backend import _backend


class _ffmlp_forward(Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.half)
    def forward(
        ctx,
        inputs,
        weights,
        input_dim,
        output_dim,
        hidden_dim,
        num_layers,
        activation,
        output_activation,
        inference=False,
        calc_grad_inputs=False,
    ):
        B = inputs.shape[0]

        inputs = inputs.contiguous()
        weights = weights.contiguous()

        # print('[inputs]', torch.any(torch.isnan(inputs)), inputs.shape, inputs.dtype, inputs.min().item(), inputs.max().item())
        # print('[weights]', torch.any(torch.isnan(weights)), weights.shape, weights.dtype, weights.min().item(), weights.max().item())

        # allocate output
        outputs = torch.empty(B, output_dim, device=inputs.device, dtype=inputs.dtype)

        if not inference:
            forward_buffer = torch.empty(
                num_layers, B, hidden_dim, device=inputs.device, dtype=inputs.dtype
            )
            _backend.ffmlp_forward(
                inputs,
                weights,
                B,
                input_dim,
                output_dim,
                hidden_dim,
                num_layers,
                activation,
                output_activation,
                forward_buffer,
                outputs,
            )
            ctx.save_for_backward(inputs, weights, outputs, forward_buffer)
            ctx.dims = (
                input_dim,
                output_dim,
                hidden_dim,
                num_layers,
                activation,
                output_activation,
                calc_grad_inputs,
            )

            # print('[outputs]', torch.any(torch.isnan(outputs)), outputs.shape, outputs.dtype, outputs.min().item(), outputs.max().item())
            # print('[forward_buffer]', torch.any(torch.isnan(forward_buffer)), forward_buffer.shape, forward_buffer.dtype, forward_buffer.min().item(), forward_buffer.max().item())
        else:
            inference_buffer = torch.empty(
                B, hidden_dim, device=inputs.device, dtype=inputs.dtype
            )
            _backend.ffmlp_inference(
                inputs,
                weights,
                B,
                input_dim,
                output_dim,
                hidden_dim,
                num_layers,
                activation,
                output_activation,
                inference_buffer,
                outputs,
            )

            # print('[outputs]', torch.any(torch.isnan(outputs)), outputs.shape, outputs.dtype, outputs.min().item(), outputs.max().item())
            # print('[inference_buffer]', torch.any(torch.isnan(inference_buffer)), inference_buffer.shape, inference_buffer.dtype, inference_buffer.min().item(), inference_buffer.max().item())

        return outputs

    @staticmethod
    @custom_bwd
    def backward(ctx, grad):
        # grad: [B, output_dim]

        B = grad.shape[0]

        grad = grad.contiguous()

        # print('[grad]', torch.any(torch.isnan(grad)), grad.shape, grad.dtype, grad.min().item(), grad.max().item())
        # print(grad)

        inputs, weights, outputs, forward_buffer = ctx.saved_tensors

        (
            input_dim,
            output_dim,
            hidden_dim,
            num_layers,
            activation,
            output_activation,
            calc_grad_inputs,
        ) = ctx.dims

        # allocate outputs
        if calc_grad_inputs:
            grad_inputs = torch.zeros_like(inputs)
        else:
            grad_inputs = torch.zeros(1, device=grad.device, dtype=grad.dtype)  # dummy

        grad_weights = torch.zeros_like(weights)
        backward_buffer = torch.zeros(
            num_layers, B, hidden_dim, device=grad.device, dtype=grad.dtype
        )

        _backend.ffmlp_backward(
            grad,
            inputs,
            weights,
            forward_buffer,
            B,
            input_dim,
            output_dim,
            hidden_dim,
            num_layers,
            activation,
            output_activation,
            calc_grad_inputs,
            backward_buffer,
            grad_inputs,
            grad_weights,
        )

        # print('[grad_inputs]', grad_inputs.shape, grad_inputs.dtype, grad_inputs.min().item(), grad_inputs.max().item())
        # print('[grad_weights]', grad_weights.shape, grad_weights.dtype, grad_weights.min().item(), grad_weights.max().item())
        # print('[backward_buffer]', backward_buffer.shape, backward_buffer.dtype, backward_buffer.min().item(), backward_buffer.max().item())
        if calc_grad_inputs:
            return (
                grad_inputs,
                grad_weights,
                None,
                None,
                None,
                None,
                None,
                None,
                None,
                None,
            )
        else:
            return None, grad_weights, None, None, None, None, None, None, None, None


ffmlp_forward = _ffmlp_forward.apply


def convert_activation(act):
    if act == "relu":
        return 0
    elif act == "exponential":
        return 1
    elif act == "sine":
        return 2
    elif act == "sigmoid":
        return 3
    elif act == "squareplus":
        return 4
    elif act == "softplus":
        return 5
    else:
        return 6


class FFMLP(nn.Module):
    def __init__(
        self, input_dim, output_dim, hidden_dim, num_layers, activation="relu"
    ):
        super().__init__()

        self.input_dim = input_dim
        self.output_dim = output_dim
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.activation = convert_activation(activation)
        self.output_activation = convert_activation("none")  # not supported currently

        self.tensorcore_width = 16

        assert hidden_dim in [
            16,
            32,
            64,
            128,
            256,
        ], f"FFMLP only support hidden_dim in [16, 32, 64, 128, 256], but got {hidden_dim}"
        assert (
            input_dim > 0 and input_dim % 16 == 0
        ), f"FFMLP input_dim should be 16 * m (m  > 0), but got {input_dim}"
        assert (
            output_dim <= 16
        ), f"FFMLP current only supports output dim <= 16, but got {output_dim}"
        assert (
            num_layers >= 2
        ), f"FFMLP num_layers should be larger than 2 (3 matmuls), but got {num_layers}"

        # pad output
        self.padded_output_dim = int(math.ceil(output_dim / 16)) * 16

        # parameters (continuous in memory)
        self.num_parameters = hidden_dim * (
            input_dim + hidden_dim * (num_layers - 1) + self.padded_output_dim
        )
        self.weights = nn.Parameter(torch.zeros(self.num_parameters))
        self.reset_parameters()

        # allocate streams
        _backend.allocate_splitk(self.num_layers + 1)

        # register destructor
        # atexit.register(self.cleanup) # how to correctly clean? this gives CUDA Error: cudaEventDestroy(events[i]) failed with error context is destroyed

    def cleanup(self):
        # destroy streams
        _backend.free_splitk()

    def __repr__(self):
        return f"FFMLP: input_dim={self.input_dim} output_dim={self.output_dim} hidden_dim={self.hidden_dim} num_layers={self.num_layers} activation={self.activation}"

    def reset_parameters(self):
        torch.manual_seed(42)
        std = math.sqrt(3 / self.hidden_dim)
        self.weights.data.uniform_(-std, std)

    def forward(self, inputs):
        # inputs: [B, input_dim]
        # return: [B, outupt_dim]

        # print('inputs', inputs.shape, inputs.dtype, inputs.min().item(), inputs.max().item(), inputs.requires_grad)

        B, C = inputs.shape
        # assert B >= 128 and B % 128 == 0, f"ffmlp batch size must be 128 * m (m > 0), but got {B}."

        # pad input
        pad = 128 - (B % 128)
        if pad > 0:
            inputs = torch.cat(
                [inputs, torch.zeros(pad, C, dtype=inputs.dtype, device=inputs.device)],
                dim=0,
            )

        outputs = ffmlp_forward(
            inputs,
            self.weights,
            self.input_dim,
            self.padded_output_dim,
            self.hidden_dim,
            self.num_layers,
            self.activation,
            self.output_activation,
            not self.training,
            inputs.requires_grad,
        )

        # unpad output
        if B != outputs.shape[0] or self.padded_output_dim != self.output_dim:
            outputs = outputs[:B, : self.output_dim]

        # print('outputs', outputs.shape, outputs.dtype, outputs.min().item(), outputs.max().item())

        return outputs


================================================
FILE: lidarnerf/ffmlp/setup.py
================================================
import os
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

_src_path = os.path.dirname(os.path.abspath(__file__))

nvcc_flags = [
    "-O3",
    "-std=c++14",
    "--expt-extended-lambda",
    "--expt-relaxed-constexpr",
    "-U__CUDA_NO_HALF_OPERATORS__",
    "-U__CUDA_NO_HALF_CONVERSIONS__",
    "-U__CUDA_NO_HALF2_OPERATORS__",
]

if os.name == "posix":
    nvcc_flags += [
        "-Xcompiler=-mf16c",
        "-Xcompiler=-Wno-float-conversion",
        "-Xcompiler=-fno-strict-aliasing",
    ]
    c_flags = ["-O3", "-std=c++14"]
elif os.name == "nt":
    c_flags = ["/O2", "/std:c++17"]

    # find cl.exe
    def find_cl_path():
        import glob

        for edition in ["Enterprise", "Professional", "BuildTools", "Community"]:
            paths = sorted(
                glob.glob(
                    r"C:\\Program Files (x86)\\Microsoft Visual Studio\\*\\%s\\VC\\Tools\\MSVC\\*\\bin\\Hostx64\\x64"
                    % edition
                ),
                reverse=True,
            )
            if paths:
                return paths[0]

    # If cl.exe is not on path, try to find it.
    if os.system("where cl.exe >nul 2>nul") != 0:
        cl_path = find_cl_path()
        if cl_path is None:
            raise RuntimeError(
                "Could not locate a supported Microsoft Visual C++ installation"
            )
        os.environ["PATH"] += ";" + cl_path

setup(
    name="ffmlp",  # package name, import this to use python API
    ext_modules=[
        CUDAExtension(
            name="_ffmlp",  # extension name, import this to use CUDA API
            sources=[
                os.path.join(_src_path, "src", f)
                for f in [
                    "ffmlp.cu",
                    "bindings.cpp",
                ]
            ],
            extra_compile_args={
                "cxx": c_flags,
                "nvcc": nvcc_flags,
            },
            include_dirs=[
                os.path.join(_src_path, "dependencies/cutlass/include"),
                os.path.join(_src_path, "dependencies/cutlass/tools/util/include"),
            ],
        ),
    ],
    cmdclass={
        "build_ext": BuildExtension,
    },
)


================================================
FILE: lidarnerf/ffmlp/src/bindings.cpp
================================================
#include <torch/extension.h>

#include "ffmlp.h"

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("ffmlp_forward", &ffmlp_forward, "ffmlp_forward (CUDA)");
    m.def("ffmlp_inference", &ffmlp_inference, "ffmlp_inference (CUDA)");
    m.def("ffmlp_backward", &ffmlp_backward, "ffmlp_backward (CUDA)");
    m.def("allocate_splitk", &allocate_splitk, "allocate_splitk (CUDA)");
    m.def("free_splitk", &free_splitk, "free_splitk (CUDA)");
}

================================================
FILE: lidarnerf/ffmlp/src/cutlass_matmul.h
================================================
/*
 * Copyright (c) 2020-2022, NVIDIA CORPORATION.  All rights reserved.
 *
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *     * Redistributions of source code must retain the above copyright notice,
 * this list of conditions and the following disclaimer.
 *     * Redistributions in binary form must reproduce the above copyright
 * notice, this list of conditions and the following disclaimer in the
 * documentation and/or other materials provided with the distribution.
 *     * Neither the name of the NVIDIA CORPORATION nor the names of its
 * contributors may be used to endorse or promote products derived from this
 * software without specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL NVIDIA CORPORATION BE LIABLE FOR ANY
 * DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
 * (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
 * LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
 * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TOR
 * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
 * SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 *//*
 */

/** @file   cutlass_matmul.h
 *  @author Thomas Müller, NVIDIA
 *  @brief  Matrix multiplication wrappers that call into CUTLASS (plus some
 * custom modifications). The parameters are optimized to give optimal
 * performance in a variety of situations. Parts of this file were adapted by
 * starting from the CUTLASS sample code (see its BSD 3-clause license).
 */

#pragma once

#include <cutlass/array.h>
#include <cutlass/cutlass.h>
#include <cutlass/functional.h>
#include <cutlass/gemm/device/gemm.h>
#include <cutlass/gemm/device/gemm_splitk_parallel.h>
#include <cutlass/numeric_conversion.h>
#include <cutlass/numeric_types.h>
#include <torch/torch.h>

#include <iostream>
#include <map>
#include <type_traits>

#include "utils.h"

//#define TCNN_VERBOSE_MEMORY_ALLOCS

#define CUTLASS_CHECK(status)                                                 \
    {                                                                         \
        cutlass::Status error = status;                                       \
        if (error != cutlass::Status::kSuccess) {                             \
            std::cerr << "Got cutlass error: "                                \
                      << cutlassGetStatusString(error) << " at: " << __LINE__ \
                      << std::endl;                                           \
            exit(EXIT_FAILURE);                                               \
        }                                                                     \
    }

#define CUDA_CHECK(status)                                                    \
    {                                                                         \
        cudaError_t error = status;                                           \
        if (error != cudaSuccess) {                                           \
            std::cerr << "Got bad cuda status: " << cudaGetErrorString(error) \
                      << " at line: " << __LINE__ << std::endl;               \
            exit(EXIT_FAILURE);                                               \
        }                                                                     \
    }

using SmArch = std::conditional_t<
        MIN_GPU_ARCH >= 80,
        std::conditional_t<std::is_same<network_precision_t, float>::value,
                           cutlass::arch::Sm75,
                           cutlass::arch::Sm80>,
        std::conditional_t<MIN_GPU_ARCH >= 75,
                           cutlass::arch::Sm75,
                           cutlass::arch::Sm70>>;

using TypeAccumulator =
        std::conditional_t<std::is_same<network_precision_t, float>::value,
                           float,
                           cutlass::half_t>;
using TypeCompute =
        std::conditional_t<std::is_same<network_precision_t, float>::value,
                           float,
                           cutlass::half_t>;

template <typename T>
using MMAOp = typename std::conditional<std::is_same<T, float>::value,
                                        cutlass::arch::OpClassSimt,
                                        cutlass::arch::OpClassTensorOp>::type;

template <typename T>
using ShapeMMAOp = typename std::conditional<
        std::is_same<MMAOp<T>, cutlass::arch::OpClassTensorOp>::value,
        typename std::conditional<
                std::is_same<SmArch, cutlass::arch::Sm80>::value ||
                        std::is_same<SmArch, cutlass::arch::Sm75>::value,
                cutlass::gemm::GemmShape<16, 8, 8>,
                cutlass::gemm::GemmShape<8, 8, 4>>::type,
        cutlass::gemm::GemmShape<1, 1, 1>>::type;

template <typename thread_block, typename warp>
struct LayerConfig {
    using k_thread_block = thread_block;
    using k_warp = warp;
};

using FullLayerK = typename std::conditional<
        std::is_same<MMAOp<network_precision_t>,
                     cutlass::arch::OpClassSimt>::value,
        LayerConfig<cutlass::gemm::GemmShape<128, 128, 8>,
                    cutlass::gemm::GemmShape<32, 64, 8>>,
        LayerConfig<cutlass::gemm::GemmShape<64, 64, 32>,
                    cutlass::gemm::GemmShape<32, 32, 32>>>::type;
using LastLayerK = FullLayerK;

using FullLayer = typename std::conditional<
        std::is_same<MMAOp<network_precision_t>,
                     cutlass::arch::OpClassSimt>::value,
        LayerConfig<cutlass::gemm::GemmShape<128, 128, 8>,
                    cutlass::gemm::GemmShape<32, 64, 8>>,
        LayerConfig<cutlass::gemm::GemmShape<128, 128, 32>,
                    cutlass::gemm::GemmShape<64, 64, 32>>>::type;
using LastLayer = FullLayer;

// This code section describes how threadblocks are scheduled on GPU
using SwizzleThreadBlock =
        cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>;

// This code section describes the epilogue part of the kernel

template <typename V>
struct CutlassFragmentWrapper {
    static const uint32_t num_elements = V::kElements;
    V x;
};

template <typename ElementOutput_,  ///< Data type used to load and store
                                    ///< tensors
          int Count,  ///< Number of elements computed per operation
          typename ElementAccumulator_ =
                  ElementOutput_,  ///< Accumulator data type
          typename ElementCompute_ =
                  ElementOutput_,  ///< Data type used to compute linear
                                   ///< combination
          cutlass::FloatRoundStyle Round =
                  cutlass::FloatRoundStyle::round_to_nearest>
class ActivationEpilogue {
public:
    using ElementOutput = ElementOutput_;
    using ElementAccumulator = ElementAccumulator_;
    using ElementCompute = ElementCompute_;

    static int const kCount = Count;

    using FragmentOutput = cutlass::Array<ElementOutput, kCount>;
    using FragmentAccumulator = cutlass::Array<ElementAccumulator, kCount>;
    using ComputeFragment = cutlass::Array<ElementCompute, kCount>;

    static cutlass::FloatRoundStyle const kRound = Round;

    struct Params {
        Activation activation;
        bool sum_source;
    };

public:
    CUTLASS_HOST_DEVICE
    ActivationEpilogue(Params const &params)
        : m_activation{params.activation}, m_sum_source{params.sum_source} {}

    CUTLASS_HOST_DEVICE
    bool is_source_needed() const { return m_sum_source; }

    /// Functionally required for serial reduction in the epilogue
    CUTLASS_HOST_DEVICE
    void set_k_partition(int k_partition, int k_partition_count) {}

    CUTLASS_HOST_DEVICE
    FragmentOutput operator()(FragmentAccumulator const &accumulator) const {
        cutlass::NumericArrayConverter<ElementCompute, ElementAccumulator,
                                       kCount, Round>
                accumulator_converter;

        auto intermediate = CutlassFragmentWrapper<ComputeFragment>{
                accumulator_converter(accumulator)};
        intermediate =
                warp_activation<ElementCompute>(m_activation, intermediate);

        cutlass::NumericArrayConverter<ElementOutput, ElementCompute, kCount,
                                       Round>
                destination_converter;
        return destination_converter(intermediate.x);
    }

    CUTLASS_HOST_DEVICE
    FragmentOutput operator()(FragmentAccumulator const &accumulator,
                              FragmentOutput const &source) const {
        cutlass::NumericArrayConverter<ElementCompute, ElementOutput, kCount,
                                       Round>
                source_converter;
        cutlass::NumericArrayConverter<ElementCompute, ElementAccumulator,
                                       kCount, Round>
                accumulator_converter;

        cutlass::plus<ComputeFragment> plus_op;
        auto intermediate = CutlassFragmentWrapper<ComputeFragment>{
                accumulator_converter(accumulator)};
        if (m_sum_source) {
            intermediate.x = plus_op(intermediate.x, source_converter(source));
        }
        intermediate =
                warp_activation<ElementCompute>(m_activation, intermediate);

        cutlass::NumericArrayConverter<ElementOutput, ElementCompute, kCount,
                                       Round>
                destination_converter;
        return destination_converter(intermediate.x);
    }

private:
    Activation m_activation;
    bool m_sum_source;
};

template <typename ElementOutput_,  ///< Data type used to load and store
                                    ///< tensors
          int Count,  ///< Number of elements computed per operation
          typename ElementAccumulator_ =
                  ElementOutput_,  ///< Accumulator data type
          typename ElementCompute_ =
                  ElementOutput_,  ///< Data type used to compute linear
                                   ///< combination
          cutlass::FloatRoundStyle Round =
                  cutlass::FloatRoundStyle::round_to_nearest>
class ActivationTransferEpilogue {
public:
    using ElementOutput = ElementOutput_;
    using ElementAccumulator = ElementAccumulator_;
    using ElementCompute = ElementCompute_;

    static int const kCount = Count;

    using FragmentOutput = cutlass::Array<ElementOutput, kCount>;
    using FragmentAccumulator = cutlass::Array<ElementAccumulator, kCount>;
    using ComputeFragment = cutlass::Array<ElementCompute, kCount>;

    static cutlass::FloatRoundStyle const kRound = Round;

    /// Host-constructable parameters structure
    struct Params {
        Activation activation;
    };

public:
    /// Constructs the function object, possibly loading from pointers in host
    /// memory
    CUTLASS_HOST_DEVICE
    ActivationTransferEpilogue(Params const &params)
        : m_activation{params.activation} {}

    /// Returns true if source is needed
    CUTLASS_HOST_DEVICE
    bool is_source_needed() const { return true; }

    /// Functionally required for serial reduction in the epilogue
    CUTLASS_HOST_DEVICE
    void set_k_partition(int k_partition, int k_partition_count) {}

    CUTLASS_HOST_DEVICE
    FragmentOutput operator()(FragmentAccumulator const &accumulator,
                              FragmentOutput const &source) const {
        cutlass::NumericArrayConverter<ElementCompute, ElementOutput, kCount,
                                       Round>
                source_converter;
        cutlass::NumericArrayConverter<ElementCompute, ElementAccumulator,
                                       kCount, Round>
                accumulator_converter;

        auto converted_source = CutlassFragmentWrapper<ComputeFragment>{
                source_converter(source)};
        auto intermediate = CutlassFragmentWrapper<ComputeFragment>{
                accumulator_converter(accumulator)};

        intermediate = warp_activation_backward<ElementCompute>(
                m_activation, intermediate, converted_source);

        cutlass::NumericArrayConverter<ElementOutput, ElementCompute, kCount,
                                       Round>
                destination_converter;
        return destination_converter(intermediate.x);
    }

    CUTLASS_HOST_DEVICE
    FragmentOutput operator()(FragmentAccumulator const &accumulator) const {
        cutlass::NumericArrayConverter<ElementCompute, ElementAccumulator,
                                       kCount, Round>
                accumulator_converter;

        ComputeFragment converted_accumulator =
                accumulator_converter(accumulator);

        cutlass::NumericArrayConverter<ElementOutput, ElementCompute, kCount,
                                       Round>
                destination_converter;

        return destination_converter(converted_accumulator);
    }

private:
    Activation m_activation;
};

template <typename T>
static constexpr int n_vectorized_elements =
        std::is_same<MMAOp<T>, cutlass::arch::OpClassTensorOp>::value
                ? (128 / cutlass::sizeof_bits<T>::value)
                : 1;

template <typename T>
using SumOp =
        cutlass::epilogue::thread::LinearCombination<T,
                                                     n_vectorized_elements<T>,
                                                     TypeAccumulator,
                                                     TypeCompute>;

template <typename T>
using IntermediateActivationOp =
        ActivationEpilogue<T, 4, TypeAccumulator, TypeCompute>;

template <typename T>
using IntermediateActivationTransferOp =
        ActivationTransferEpilogue<T, 4, TypeAccumulator, TypeCompute>;

template <typename T>
using ActivationOp = ActivationEpilogue<T,
                                        n_vectorized_elements<T>,
                                        TypeAccumulator,
                                        TypeCompute>;

template <typename T>
using ActivationTransferOp =
        ActivationTransferEpilogue<T,
                                   n_vectorized_elements<T>,
                                   TypeAccumulator,
                                   TypeCompute>;

template <typename EPILOGUE,
          typename LayerConfig,
          typename TypeA,
          typename LayoutA,
          typename TypeB,
          typename LayoutB,
          typename TypeOutput,
          typename LayoutOutput>
using OurGemm =
        cutlass::gemm::device::Gemm<TypeA,
                                    LayoutA,
                                    TypeB,
                                    LayoutB,
                                    TypeOutput,
                                    LayoutOutput,
                                    TypeAccumulator,
                                    MMAOp<TypeA>,
                                    SmArch,
                                    typename LayerConfig::k_thread_block,
                                    typename LayerConfig::k_warp,
                                    ShapeMMAOp<TypeA>,
                                    EPILOGUE,
                                    SwizzleThreadBlock,
                                    2>;

template <typename EPILOGUE,
          typename LayerConfig,
          typename TypeA,
          typename LayoutA,
          typename TypeB,
          typename LayoutB,
          typename TypeOutput,
          typename LayoutOutput>
using SplitKGemm = cutlass::gemm::device::GemmSplitKParallel<
        TypeA,
        LayoutA,
        TypeB,
        LayoutB,
        TypeOutput,
        LayoutOutput,
        TypeAccumulator,
        MMAOp<TypeA>,
        SmArch,
        typename LayerConfig::k_thread_block,
        typename LayerConfig::k_warp,
        ShapeMMAOp<TypeA>,
        EPILOGUE>;

inline std::map<cudaStream_t, GPUMemory<uint8_t>> &cutlass_workspaces() {
    static std::map<cudaStream_t, GPUMemory<uint8_t>> s_workspaces;
    return s_workspaces;
}

inline uint8_t *cutlass_get_workspace(size_t size, cudaStream_t stream) {
    GPUMemory<uint8_t> &workspace = cutlass_workspaces()[stream];
    if (size > workspace.size()) {
        size *= 2;
#ifdef TCNN_VERBOSE_MEMORY_ALLOCS
        std::cout << "CUTLASS GEMM: Allocating temporary workspace of "
                  << bytes_to_string(size) << "." << std::endl;
#endif

        // Allocate twice the requested size to make sure we're not constantly
        // allocating small increments.
        workspace.resize(size);
    }
    return workspace.data();
}

inline void cutlass_free_workspace(cudaStream_t stream) {
    if (cutlass_workspaces().count(stream) == 0) {
        return;
    }

#ifdef TCNN_VERBOSE_MEMORY_ALLOCS
    std::cout << "CUTLASS GEMM: Freeing temporary workspace of "
              << bytes_to_string(cutlass_workspaces().at(stream).size()) << "."
              << std::endl;
#endif
    cutlass_workspaces().erase(stream);
}

template <class Gemm>
void fc_multiply_impl(cudaStream_t stream,
                      const typename Gemm::Arguments &args) {
    // Using the arguments, query for extra workspace required for matrix
    // multiplication computation
    size_t workspace_size = Gemm::get_workspace_size(args);

    // Instantiate CUTLASS kernel depending on templates
    Gemm gemm_op;

    // Initialize CUTLASS kernel with arguments and workspace pointer
    cutlass::Status status = gemm_op.initialize(
            args, cutlass_get_workspace(workspace_size, stream), stream);
    CUTLASS_CHECK(status);

    // Launch initialized CUTLASS kernel
    status = gemm_op(stream);
    CUTLASS_CHECK(status);
}

template <class Gemm>
void fc_multiply_split_k_impl(cudaStream_t stream,
                              const typename Gemm::Arguments &args) {
    // Using the arguments, query for extra workspace required for matrix
    // multiplication computation
    size_t workspace_size = Gemm::get_workspace_size(args);

    // Instantiate CUTLASS kernel depending on templates
    Gemm gemm_op;

    // Initialize CUTLASS kernel with arguments and workspace pointer
    cutlass::Status status = gemm_op.initialize(
            args, cutlass_get_workspace(workspace_size, stream));
    CUTLASS_CHECK(status);

    // Launch initialized CUTLASS kernel
    status = gemm_op(stream);
    CUTLASS_CHECK(status);
}

//////////////////////////////////////////////////////////////////////////////////
////////////////////////////        modified ///////////////////////////////
//////////////////////////////////////////////////////////////////////////////////

template <typename config, bool RM_A, bool RM_B, bool RM_C>
void fc_multiply(cudaStream_t stream,
                 const int M,
                 const int K,
                 const int N,
                 const __half *A,
                 const __half *B,
                 const __half *C,
                 __half *D,
                 Activation act = Activation::None,
                 bool transfer = false,
                 bool sum_source = false) {
    // compute  D = A @ B + C
    // A: [M, K]
    // B: [K, N]
    // C, D: [M, N]
    using CutlassLayoutA =
            typename std::conditional<RM_A, cutlass::layout::RowMajor,
                                      cutlass::layout::ColumnMajor>::type;
    using CutlassLayoutB =
            typename std::conditional<RM_B, cutlass::layout::RowMajor,
                                      cutlass::layout::ColumnMajor>::type;
    using CutlassLayoutC =
            typename std::conditional<RM_C, cutlass::layout::RowMajor,
                                      cutlass::layout::ColumnMajor>::type;

    using MatmulTypeCompute = cutlass::half_t;
    using MatmulTypeAccumulator = cutlass::half_t;

    const int lda = RM_A ? K : M;
    const int ldb = RM_B ? N : K;
    const int ldc = RM_C ? N : M;
    const int ldd = RM_C ? N : M;

    if (transfer) {
        using Gemm =
                OurGemm<ActivationTransferOp<MatmulTypeAccumulator>, config,
                        MatmulTypeCompute, CutlassLayoutA, MatmulTypeCompute,
                        CutlassLayoutB, MatmulTypeAccumulator, CutlassLayoutC>;
        typename Gemm::Arguments arguments{{M, N, K},
                                           {(MatmulTypeCompute *)A, lda},
                                           {(MatmulTypeCompute *)B, ldb},
                                           {(MatmulTypeAccumulator *)C, ldc},
                                           {(MatmulTypeAccumulator *)D, ldd},
                                           {act},
                                           1};

        fc_multiply_impl<Gemm>(stream, arguments);
    } else {
        using Gemm =
                OurGemm<ActivationOp<MatmulTypeAccumulator>, config,
                        MatmulTypeCompute, CutlassLayoutA, MatmulTypeCompute,
                        CutlassLayoutB, MatmulTypeAccumulator, CutlassLayoutC>;
        typename Gemm::Arguments arguments{{M, N, K},
                                           {(MatmulTypeCompute *)A, lda},
                                           {(MatmulTypeCompute *)B, ldb},
                                           {(MatmulTypeAccumulator *)C, ldc},
                                           {(MatmulTypeAccumulator *)D, ldd},
                                           {act, sum_source},
                                           1};

        fc_multiply_impl<Gemm>(stream, arguments);
    }
}

template <typename config, bool RM_A, bool RM_B, bool RM_C>
void fc_multiply(cudaStream_t stream,
                 const int M,
                 const int K,
                 const int N,
                 const __half *A,
                 const __half *B,
                 __half *D,
                 Activation act = Activation::None) {
    fc_multiply<config, RM_A, RM_B, RM_C>(stream, M, K, N, A, B, D, D, act);
}

template <typename config, bool RM_A, bool RM_B, bool RM_C>
void fc_multiply_split_k(cudaStream_t stream,
                         const int M,
                         const int K,
                         const int N,
                         const __half *A,
                         const __half *B,
                         const __half *C,
                         __half *D,
                         int split_k_slices = 1) {
    // A: [M, K]
    // B: [K, N]
    // C, D: [M, N]
    using CutlassLayoutA =
            typename std::conditional<RM_A, cutlass::layout::RowMajor,
                                      cutlass::layout::ColumnMajor>::type;
    using CutlassLayoutB =
            typename std::conditional<RM_B, cutlass::layout::RowMajor,
                                      cutlass::layout::ColumnMajor>::type;
    using CutlassLayoutC =
            typename std::conditional<RM_C, cutlass::layout::RowMajor,
                                      cutlass::layout::ColumnMajor>::type;

    using MatmulTypeCompute = cutlass::half_t;
    using MatmulTypeAccumulator = cutlass::half_t;

    const int lda = RM_A ? K : M;
    const int ldb = RM_B ? N : K;
    const int ldc = RM_C ? N : M;
    const int ldd = RM_C ? N : M;

    using Gemm =
            SplitKGemm<SumOp<MatmulTypeAccumulator>, config, MatmulTypeCompute,
                       CutlassLayoutA, MatmulTypeCompute, CutlassLayoutB,
                       MatmulTypeAccumulator, CutlassLayoutC>;

    typename Gemm::Arguments arguments{{M, N, K},
                                       {(MatmulTypeCompute *)A, lda},
                                       {(MatmulTypeCompute *)B, ldb},
                                       {(MatmulTypeAccumulator *)C, ldc},
                                       {(MatmulTypeAccumulator *)D, ldd},
                                       {(TypeCompute)1.0f, (TypeCompute)0.0f},
                                       split_k_slices};

    fc_multiply_split_k_impl<Gemm>(stream, arguments);
}

template <typename config, bool RM_A, bool RM_B, bool RM_C>
void fc_multiply_split_k(cudaStream_t stream,
                         const int M,
                         const int K,
                         const int N,
                         const __half *A,
                         const __half *B,
                         __half *D,
                         int split_k_slices = 1) {
    fc_multiply_split_k<config, RM_A, RM_B, RM_C>(stream, M, K, N, A, B, D, D,
                                                  split_k_slices);
}


================================================
FILE: lidarnerf/ffmlp/src/ffmlp.cu
================================================
#include <ATen/cuda/CUDAContext.h>
#include <cuda.h>
#include <cuda_fp16.h>
#include <cuda_runtime.h>
#include <mma.h>
#include <stdint.h>
#include <torch/torch.h>

#include <algorithm>
#include <cstdio>
#include <stdexcept>
#include <vector>

#include "cutlass_matmul.h"
#include "utils.h"

__host__ __device__ Activation convert_activation(const uint32_t activation) {
    switch (activation) {
        case 0:
            return Activation::ReLU;
        case 1:
            return Activation::Exponential;
        case 2:
            return Activation::Sine;
        case 3:
            return Activation::Sigmoid;
        case 4:
            return Activation::Squareplus;
        case 5:
            return Activation::Softplus;
        case 6:
            return Activation::None;
        default:
            return Activation::None;
    }
}

template <typename T>
__host__ __device__ T div_round_up(T val, T divisor) {
    return (val + divisor - 1) / divisor;
}

void check_shmem_error(cudaError_t error) {
    if (error != cudaSuccess) {
        throw std::runtime_error{
                "FullyFusedMLP: insufficient shared memory available on the "
                "GPU. "
                "Reduce `n_neurons` or use `CutlassMLP` (better compatibility "
                "but "
                "slower) instead."};
    }
}

template <int WIDTH,
          int BLOCK_DIM_Z,
          int N_ITERS,
          typename OUT_T,
          bool BACKWARD = false>
__device__ void threadblock_layer(
        Activation activation,
        __half *__restrict__ act_shmem,
        const __half *__restrict__ weights_this_layer,
        OUT_T *__restrict__ out_intermediate_threadblock_this_layer,
        const OUT_T *__restrict__ activation_aux = nullptr) {
    // act_shmem contains the intermediate activations (shared memory) of the
    // thread block's chunk of the batch.
    //           Can be forward activations or backward activations, depending
    //           on caller.
    // weights_this_layer points to the weight matrix of the current layer.
    // out_intermediate_threadblock_this_layer points to the location where
    // intermediate activations produced by the thread block should be written
    // to.
    //                  Can be nullptr if nothing should be written.
    // activation_aux points to additional arguments that the activation
    // function may depend on. Points to the hidden forward activations when
    // computing backward activations.

    constexpr uint32_t SKEW = WIDTH % 16 == 0 ? 8 : 0;
    constexpr uint32_t N_BLOCKS = WIDTH / 16;

    using namespace nvcuda;

    // If we're performing the backward pass, weights must be loaded in
    // transposed form, which is achieved by interpreting the memory in
    // row_major instead of col_major order.
    using weights_layout_t =
            std::conditional_t<BACKWARD, wmma::row_major, wmma::col_major>;

    // Fragments
    wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major>
            act_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, weights_layout_t>
            weights_frag[N_BLOCKS];
    wmma::fragment<wmma::accumulator, 16, 16, 16, OUT_T> result_frag[N_ITERS];

    // Indices
    const uint32_t li = threadIdx.x;  // index in warp ("lane index")
    const uint32_t wi = threadIdx.y;  // index in block ("warp index")

    const uint32_t lane_offset = (8 * li) % WIDTH;
    const uint32_t row = (8 * li + wi * 8 * 32) / WIDTH;

    const uint32_t weights_col = 16 * wi;

    __syncthreads();

// Load N_BLOCKS chunks of weights from global memory into registers.
#pragma unroll
    for (uint32_t i = 0; i < N_BLOCKS; ++i) {
        if (BACKWARD) {
            // If we're performing the backward pass, additional index swizzling
            // is needed to load the weights in transposed form.
            wmma::load_matrix_sync(
                    weights_frag[i],
                    weights_this_layer + 16 * i * WIDTH + weights_col, WIDTH);
        } else {
            wmma::load_matrix_sync(
                    weights_frag[i],
                    weights_this_layer + 16 * i + weights_col * WIDTH, WIDTH);
        }
    }

#pragma unroll
    for (int l = 0; l < N_ITERS; ++l) {
        wmma::fill_fragment(result_frag[l], 0.0f);

#pragma unroll
        for (uint32_t i = 0; i < N_BLOCKS; ++i) {
            // Load a chunk of intermediate activations from shared memory and
            // multiply with chunk of weights
            wmma::load_matrix_sync(
                    act_frag,
                    act_shmem + 16 * i +
                            (16 * (threadIdx.z + l * BLOCK_DIM_Z)) *
                                    (WIDTH + SKEW),
                    WIDTH + SKEW);
            wmma::mma_sync(result_frag[l], act_frag, weights_frag[i],
                           result_frag[l]);
        }

        // Activation
        if (BACKWARD) {
            // Load the temporary forward matrix for the relu transfer
            wmma::load_matrix_sync(
                    act_frag,
                    activation_aux + weights_col +
                            (threadIdx.z + l * BLOCK_DIM_Z) * 16 * WIDTH,
                    WIDTH);
            warp_activation_backward<__half>(activation, result_frag[l],
                                             act_frag, result_frag[l]);
        } else {
            warp_activation<__half>(activation, result_frag[l], result_frag[l]);
        }
    }

    __syncthreads();

#pragma unroll
    for (int l = 0; l < N_ITERS; ++l) {
        wmma::store_matrix_sync(
                act_shmem + weights_col +
                        (threadIdx.z + l * BLOCK_DIM_Z) * 16 * (WIDTH + SKEW),
                result_frag[l], WIDTH + SKEW, wmma::mem_row_major);
    }

    if (out_intermediate_threadblock_this_layer != nullptr) {
        __syncthreads();

#pragma unroll
        for (int l = 0; l < N_ITERS; ++l) {
            *(int4 *)&out_intermediate_threadblock_this_layer
                    [lane_offset +
                     (row + 16 * (threadIdx.z + l * BLOCK_DIM_Z)) * WIDTH] =
                    *(int4 *)&act_shmem[lane_offset +
                                        (row +
                                         16 * (threadIdx.z + l * BLOCK_DIM_Z)) *
                                                (WIDTH + SKEW)];
        }
    }
}

template <int WIDTH, int BLOCK_DIM_Z, int N_ITERS>
__device__ void threadblock_load_input_static(
        __half *__restrict__ act_shmem,
        const __half *__restrict__ input_threadblock) {
    // act_shmem will be filled by the thread block's chunk of input_threadblock

    constexpr uint32_t SKEW = WIDTH % 16 == 0 ? 8 : 0;

    // Indices
    const uint32_t li = threadIdx.x;  // index in warp ("lane index")
    const uint32_t wi = threadIdx.y;  // index in block ("warp index")

    const uint32_t lane_offset = (8 * li) % WIDTH;
    const uint32_t row = (8 * li + wi * 8 * 32) / WIDTH;

#pragma unroll
    for (int i = 0; i < N_ITERS; ++i) {
        *(int4 *)&act_shmem[lane_offset +
                            (row + 16 * (threadIdx.z + i * BLOCK_DIM_Z)) *
                                    (WIDTH + SKEW)] =
                *(int4 *)&input_threadblock
                        [lane_offset +
                         (row + 16 * (threadIdx.z + i * BLOCK_DIM_Z)) * WIDTH];
    }
}

template <int WIDTH, int BLOCK_DIM_Z, int N_ITERS, typename OUT_T>
__device__ void threadblock_input_layer_forward_dynamic(
        Activation activation,
        __half *__restrict__ act_shmem,
        const __half *__restrict__ input_threadblock,
        const __half *__restrict__ weights_this_layer,
        OUT_T *__restrict__ out_intermediate_threadblock_this_layer,
        const uint32_t in_width) {
    // act_shmem contains the intermediate activations (shared memory) of the
    // thread block's chunk of the batch input_threadblock points to the thread
    // block's chunk of the input batch in global memory weights_this_layer
    // points to the weight matrix of the current layer
    // out_intermediate_threadblock_this_layer points to the location where
    // intermediate activations produced by the thread block should be written
    // to.
    //                  Can be nullptr if nothing should be written.
    // in_width is the dynamic width of the input layer

    constexpr uint32_t SKEW = WIDTH % 16 == 0 ? 8 : 0;
    constexpr uint32_t INPUT_SKEW = 8;
    constexpr uint32_t N_BLOCKS = WIDTH / 16;

    using namespace nvcuda;

    // Fragments
    wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major>
            act_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major>
            weights_frag;
    wmma::fragment<wmma::accumulator, 16, 16, 16, OUT_T> result_frag[N_ITERS];

    // Indices
    const uint32_t li = threadIdx.x;  // index in warp ("lane index")
    const uint32_t wi = threadIdx.y;  // index in block ("warp index")

    const uint32_t lane_offset = (8 * li) % WIDTH;
    const uint32_t row = (8 * li + wi * 8 * 32) / WIDTH;

    const uint32_t weights_col = 16 * wi;

    __half *__restrict__ weights_shmem =
            act_shmem + BLOCK_DIM_Z * 16 * (in_width + INPUT_SKEW);

    // Load input weight matrix (fits completely into shared memory)
    // Each thread can load 8 fp16 elements (16 bytes) at once; we have
    // N_BLOCKS*BLOCK_DIM_Z warps
    const uint32_t n_elems_per_load = N_BLOCKS * 32 * BLOCK_DIM_Z * 8;
    const uint32_t thread_elem_idx =
            (li + wi * 32 + threadIdx.z * N_BLOCKS * 32) * 8;

    const uint32_t n_elems_b = WIDTH * in_width;

#pragma unroll
    for (uint32_t idx = thread_elem_idx; idx < n_elems_b;
         idx += n_elems_per_load) {
        const uint32_t idx_skewed = idx + idx / in_width * INPUT_SKEW;
        *(int4 *)&weights_shmem[idx_skewed] = *(int4 *)&weights_this_layer[idx];
    }

    const uint32_t n_tensor_ops = in_width / 16;

#pragma unroll
    for (int l = 0; l < N_ITERS; ++l) {
        // Load chunk of inputs into shmem.
        // This is faster than loading it from gmem directly, even though it is
        // only used once. (Possibly due to latency hiding through staging.)
        const uint32_t n_elems_a = BLOCK_DIM_Z * 16 * in_width;

#pragma unroll
        for (uint32_t idx = thread_elem_idx; idx < n_elems_a;
             idx += n_elems_per_load) {
            const uint32_t idx_skewed = idx + idx / in_width * INPUT_SKEW;
            *(int4 *)&act_shmem[idx_skewed] =
                    *(int4 *)&input_threadblock[l * n_elems_a + idx];
        }

        __syncthreads();

        wmma::fill_fragment(result_frag[l], 0.0f);
#pragma unroll
        for (uint32_t i = 0; i < n_tensor_ops; ++i) {
            // Load chunk of inputs and weights from shared memory and multiply
            // them
            wmma::load_matrix_sync(
                    act_frag,
                    act_shmem + 16 * i +
                            (16 * threadIdx.z) * (in_width + INPUT_SKEW),
                    in_width + INPUT_SKEW);
            wmma::load_matrix_sync(
                    weights_frag,
                    weights_shmem + 16 * i +
                            weights_col * (in_width + INPUT_SKEW),
                    in_width + INPUT_SKEW);
            wmma::mma_sync(result_frag[l], act_frag, weights_frag,
                           result_frag[l]);
        }

        __syncthreads();

        warp_activation<__half>(activation, result_frag[l], result_frag[l]);
    }

#pragma unroll
    for (int l = 0; l < N_ITERS; ++l) {
        wmma::store_matrix_sync(
                act_shmem + weights_col +
                        (16 * (threadIdx.z + l * BLOCK_DIM_Z)) * (WIDTH + SKEW),
                result_frag[l], WIDTH + SKEW, wmma::mem_row_major);
    }

    if (out_intermediate_threadblock_this_layer != nullptr) {
        __syncthreads();

#pragma unroll
        for (int i = 0; i < N_ITERS; ++i) {
            *(int4 *)&out_intermediate_threadblock_this_layer
                    [lane_offset +
                     (row + 16 * (threadIdx.z + i * BLOCK_DIM_Z)) * WIDTH] =
                    *(int4 *)&act_shmem[lane_offset +
                                        (row +
                                         16 * (threadIdx.z + i * BLOCK_DIM_Z)) *
                                                (WIDTH + SKEW)];
        }
    }
}

template <int WIDTH, int BLOCK_DIM_Z, int N_ITERS, typename OUT_T>
__device__ void threadblock_last_layer_forward(
        Activation activation,
        __half *__restrict__ act_shmem,
        const __half *__restrict__ weights_this_layer,
        OUT_T *__restrict__ out,
        const uint32_t batch_size,
        const nvcuda::wmma::layout_t output_layout) {
    // act_shmem contains the intermediate activations (shared memory) of the
    // thread block's chunk of the batch weights_this_layer points to the weight
    // matrix of the current layer out points to the location where the result
    // produced by the thread block should be written to.
    //   Can be nullptr if nothing should be written.

    constexpr uint32_t SKEW = WIDTH % 16 == 0 ? 8 : 0;
    constexpr uint32_t N_BLOCKS = WIDTH / 16;

    using namespace nvcuda;

    // Fragments
    wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major>
            act_frag;
    wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::col_major>
            weights_frag[N_BLOCKS];
    wmma::fragment<wmma::accumulator, 16, 16, 16, OUT_T> result_frag;

    // Indices
    const uint32_t li = threadIdx.x;  // index in warp ("lane index")
    const uint32_t wi = threadIdx.y;  // index in block ("warp index")

    __half *__restrict__ weights_shmem =
            act_shmem + N_ITERS * BLOCK_DIM_Z * 16 * (WIDTH + SKEW);

    const uint32_t weights_row = (8 * li) % WIDTH;
    const uint32_t weights_col = (8 * li + 8 * 32 * wi) / WIDTH;

    // Load weight matrix into shared memory for the last multiplication.
    // Loading into shared memory as opposed to directly into registers is
    // faster because unlike in the previous layers, each warp uses the same
    // entries of the weight matrix.
    if (threadIdx.z == 0) {
        *(int4 *)&weights_shmem[weights_row + weights_col * (WIDTH + SKEW)] =
                *(int4 *)&weights_this_layer[weights_row + weights_col * WIDTH];
        // printf("[last forward] base=%d, shmem=%d, weight=%d\n", N_ITERS *
        // BLOCK_DIM_Z * 16 * (WIDTH + SKEW), weights_row + weights_col * (WIDTH
        // + SKEW), weights_row + weights_col * WIDTH);
    }

    __syncthreads();

#pragma unroll
    for (uint32_t i = 0; i < N_BLOCKS; ++i)
        wmma::load_matrix_sync(weights_frag[i], weights_shmem + 16 * i,
                               WIDTH + SKEW);

    // Perform last layer by parallelizing over iters
    for (uint32_t idx = wi; idx < N_ITERS; idx += N_BLOCKS) {
        wmma::fill_fragment(result_frag, 0.0f);

#pragma unroll
        for (uint32_t i = 0; i < N_BLOCKS; ++i) {
            // Load a chunk of intermediate activations from shared memory and
            // multiply with chunk of the weight matrix
            wmma::load_matrix_sync(
                    act_frag,
                    act_shmem + 16 * i +
                            (16 * (threadIdx.z + idx * BLOCK_DIM_Z)) *
                                    (WIDTH + SKEW),
                    WIDTH + SKEW);
            wmma::mma_sync(result_frag, act_frag, weights_frag[i], result_frag);
        }

        warp_activation<__half>(activation, result_frag, result_frag);

        if (output_layout == wmma::mem_row_major) {
            wmma::store_matrix_sync(
                    out + (threadIdx.z + idx * BLOCK_DIM_Z) * 16 * 16,
                    result_frag, 16, output_layout);
            // printf("[last forward] RM write out %d, batch %d\n", (threadIdx.z
            // + idx
            // * BLOCK_DIM_Z) * 16 * 16, 16);
        } else {
            wmma::store_matrix_sync(
                    out + (threadIdx.z + idx * BLOCK_DIM_Z) * 16, result_frag,
                    batch_size, output_layout);
            // printf("[last forward] CM write out %d, batch %d\n", (threadIdx.z
            // + idx
            // * BLOCK_DIM_Z) * 16, batch_size);
        }
    }
}

template <int WIDTH, int BLOCK_DIM_Z, int N_ITERS>
__device__ void threadblock_write_output_static(
        const __half *__restrict__ act_shmem,
        __half *__restrict__ output_threadblock) {
    // output_threadblock will be filled by the thread block's act_shmem

    constexpr uint32_t SKEW = WIDTH % 16 == 0 ? 8 : 0;

    // Indices
    const uint32_t li = threadIdx.x;  // index in warp ("lane index")
    const uint32_t wi = threadIdx.y;  // index in block ("warp index")

    const uint32_t lane_offset = (8 * li) % WIDTH;
    const uint32_t row = (8 * li + wi * 8 * 32) / WIDTH;

    __syncthreads();

#pragma unroll
    for (int i = 0; i < N_ITERS; ++i) {
        *(int4 *)&output_threadblock[lane_offset +
                                     (row +
                                      16 * (threadIdx.z + i * BLOCK_DIM_Z)) *
                                             WIDTH] =
                *(int4 *)&act_shmem[lane_offset +
                                    (row +
                                     16 * (threadIdx.z + i * BLOCK_DIM_Z)) *
                                            (WIDTH + SKEW)];
    }
}

///////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////
///////////////////////////////////////////////////////////////////////////////////////////////

template <int WIDTH,
          int BLOCK_DIM_Z,
          int N_ITERS,
          typename OUT_T,
          bool INFERENCE>
__global__ void kernel_mlp_fused(const Activation activation,
                                 const Activation output_activation,
                                 const __half *__restrict__ input,
                                 const __half *__restrict__ weights,
                                 OUT_T *__restrict__ out_intermediate,
                                 OUT_T *__restrict__ out,
                                 const uint32_t batch_size,
                                 const uint32_t in_width,
                                 const uint32_t out_width,
                                 const uint32_t n_hidden_matmuls,
                                 const nvcuda::wmma::layout_t output_layout =
                                         nvcuda::wmma::mem_row_major) {
    // `input` points to the input matrix. Can be any width.
    // `weights` points to the weight matrices (contiguous in memory).
    // `out_intermediate` points to the memory where intermediate activations
    // should be written. When performing inference, a value of nullptr is
    // expected (intermediate results are not written). `out` points to the
    // memory where the network output should be written. (Output width is
    // assumed to be 16 neurons.)

    // if (threadIdx.x == 0) printf("[forward] call kernel_mlp_fused\n");
    // if (threadIdx.x == 0) printf("[forward] inputs=%f\n", (float)input[0]);
    // if (threadIdx.x == 0) printf("[forward] weights=%f\n",
    // (float)weights[0]);

    // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
    // (float)out_intermediate[0]);

    // Shared memory contains the intermediate activations of blockDim.y*16
    // elements. In some cases, it also contains the weight matrix for the first
    // and last layer.
    extern __shared__ __half shmem[];
    __half *act_shmem = shmem;

    // Each block computes exactly one 16-element chunk of the batch.
    const uint32_t elem_idx = 16 * blockIdx.x * N_ITERS * BLOCK_DIM_Z;

    // First layer
    if (in_width == WIDTH) {
        // If the input has the same width as the network, we can simply use the
        // network's regular layer routine (with static size) instead of using
        // the slower dynamic input layer routine.
        threadblock_load_input_static<WIDTH, BLOCK_DIM_Z, N_ITERS>(
                act_shmem, input + elem_idx * WIDTH);
        threadblock_layer<WIDTH, BLOCK_DIM_Z, N_ITERS, OUT_T>(
                activation, act_shmem, weights,
                !INFERENCE ? (out_intermediate + elem_idx * WIDTH) : nullptr);
    } else {
        threadblock_input_layer_forward_dynamic<WIDTH, BLOCK_DIM_Z, N_ITERS,
                                                OUT_T>(
                activation, act_shmem, input + elem_idx * in_width, weights,
                !INFERENCE ? (out_intermediate + elem_idx * WIDTH) : nullptr,
                in_width);
    }

    // if (threadIdx.x == 0) printf("[forward] kernel_mlp_fused: passed first
    // layer\n");
    // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
    // (float)out_intermediate[0]);

    const uint32_t first_layer_size = WIDTH * in_width;
    const uint32_t layer_stride = WIDTH * WIDTH;
    const uint32_t output_stride = WIDTH * batch_size;

    // Hidden layers
    for (uint32_t k = 0; k < n_hidden_matmuls; ++k) {
        threadblock_layer<WIDTH, BLOCK_DIM_Z, N_ITERS, OUT_T>(
                activation, act_shmem,
                weights + first_layer_size + layer_stride * k,
                !INFERENCE ? (out_intermediate + output_stride * (k + 1) +
                              elem_idx * WIDTH)
                           : nullptr);
        // if (threadIdx.x == 0) printf("[forward] kernel_mlp_fused: passed %d
        // layer\n", k + 1);
        // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
        // (float)out_intermediate[0]);
    }

    if (out_width > 16) {
        // In the forward pass, intermediate activations are already written
        // out.
        if (INFERENCE) {
            threadblock_write_output_static<WIDTH, BLOCK_DIM_Z, N_ITERS>(
                    act_shmem, out_intermediate + elem_idx * WIDTH);
        }
    } else if (out) {
        // Last layer
        if (output_layout == nvcuda::wmma::mem_row_major) {
            // printf("[last layer] RM write to out %d\n", elem_idx * 16);
            // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
            // (float)out_intermediate[0]);
            threadblock_last_layer_forward<WIDTH, BLOCK_DIM_Z, N_ITERS, OUT_T>(
                    output_activation, act_shmem,
                    weights + first_layer_size +
                            layer_stride * n_hidden_matmuls,
                    out + elem_idx * 16, 16, output_layout);
            // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
            // (float)out_intermediate[0]);
        } else {
            // printf("[last layer] CM write to out %d\n", elem_idx);
            // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
            // (float)out_intermediate[0]);
            threadblock_last_layer_forward<WIDTH, BLOCK_DIM_Z, N_ITERS, OUT_T>(
                    output_activation, act_shmem,
                    weights + first_layer_size +
                            layer_stride * n_hidden_matmuls,
                    out + elem_idx, batch_size, output_layout);
            // if (threadIdx.x == 0) printf("[forward] forward_buffer=%f\n",
            // (float)out_intermediate[0]);
        }
    }
}

template <int WIDTH, int BLOCK_DIM_Z, int N_ITERS, typename OUTPUT_LAYOUT>
__global__ void kernel_mlp_fused_backward(
        const Activation activation,
        const __half *__restrict__ dL_doutput,
        const __half *__restrict__ weights,
        __half *__restrict__ out_intermediate,
        const __half *__restrict__ forward,
        __half *__restrict__ dL_dinput,
        const __half *__restrict__ weights_first_layer,
        const uint32_t batch_size,
        const uint32_t out_width,
        const uint32_t n_hidden_matmuls) {
    // `dL_doutput` points to the input matrix of the backward pass, i.e. the
    // loss gradients. Assumed to be 16 neurons wide. `weights` points to the
    // weight matrices (contiguous in memory). `out_intermediate` points to the
    // memory where backpropagated activation gradients should be written.
    // `forward` points to the memory where the intermediate activations of the
    // forward pass are located. (needed for activation backprop)

    constexpr uint32_t SKEW = WIDTH % 16 == 0 ? 8 : 0;

    // Indices
    const uint32_t li = threadIdx.x;  // index in warp ("lane index")
    const uint32_t wi = threadIdx.y;  // index in block ("warp index")
    const uint32_t bi = blockIdx.x;   // block index

    // Shared memory contains the intermediate activations of blockDim.y*16
    // elements. A skew is applied to the matrix storage to avoid bank
    // conflicts.
    extern __shared__ __half shmem[];
    __half *act_shmem = shmem;

    const uint32_t lane_offset = (8 * li) % WIDTH;
    const uint32_t row = (8 * li + wi * 8 * 32) / WIDTH;

    // Multipying one 16-row chunk of intermediate activations with the weight
    // matrix requires all warps of the block. Thus, each block computes exactly
    // one 16-row chunk of the next layer's intermediate activations.
    const uint32_t elem_idx_base = 16 * bi * N_ITERS * BLOCK_DIM_Z;
    const uint32_t elem_idx = elem_idx_base + 16 * threadIdx.z;

    const uint32_t layer_stride = WIDTH * WIDTH;
    const uint32_t output_stride = WIDTH * batch_size;

    // Backprop through last layer
    if (out_width <= 16) {
        using namespace nvcuda;

        // Fragments in registers
        wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, OUTPUT_LAYOUT>
                act_frag;
        wmma::fragment<wmma::matrix_b, 16, 16, 16, __half, wmma::row_major>
                weights_frag;
        wmma::fragment<wmma::accumulator, 16, 16, 16, __half>
                result_frag[N_ITERS];

        // Load the relevant chunk of the last layer's weight matrix from global
        // memory into registers
        const uint32_t weights_col = 16 * wi;

        wmma::load_matrix_sync(
                weights_frag,
                weights + layer_stride * n_hidden_matmuls + weights_col, WIDTH);

#pragma unroll
        for (int l = 0; l < N_ITERS; ++l) {
            wmma::fill_fragment(result_frag[l], 0.0f);

            // Load a chunk of output gradients from shared memory and multiply
            // with previously loaded weights
            if (std::is_same<OUTPUT_LAYOUT, wmma::row_major>::value) {
                wmma::load_matrix_sync(
                        act_frag,
                        dL_doutput + (elem_idx +
                                      16 * (threadIdx.z + l * BLOCK_DIM_Z)) *
                                             16,
                        16);
            } else {
                wmma::load_matrix_sync(
                        act_frag,
                        dL_doutput + (elem_idx +
                                      16 * (threadIdx.z + l * BLOCK_DIM_Z)),
                        batch_size);
            }

            // NOTE: activation transfer of the _output_ activation is expected
            // to be done _prior_ to calling this kernel
            //       in a separate pass, because the tranfered activation
            //       gradient is also needed to compute the weight gradient of
            //       the last weight matrix (see backward()).
            wmma::mma_sync(result_frag[l], act_frag, weights_frag,
                           result_frag[l]);

            // Load the temporary forward matrix for the relu transfer
            wmma::fragment<wmma::matrix_a, 16, 16, 16, __half, wmma::row_major>
                    forward_frag;
            wmma::load_matrix_sync(
                    forward_frag,
                    forward + output_stride * n_hidden_matmuls + weights_col +
                            (elem_idx + l * BLOCK_DIM_Z * 16) * WIDTH,
                    WIDTH);

            warp_activation_backward<__half>(activation, result_frag[l],
                                             forward_frag, result_frag[l]);
        }

        __syncthreads();

#pragma unroll
        for (int l = 0; l < N_ITERS; ++l) {
            wmma::store_matrix_sync(
                    act_shmem + weights_col +
                            (16 * (threadIdx.z + l * BLOCK_DIM_Z)) *
                                    (WIDTH + SKEW),
                    result_frag[l], WIDTH + SKEW, wmma::mem_row_major);
        }

        __syncthreads();

#pragma unroll
        for (int i = 0; i < N_ITERS; ++i) {
            *(int4 *)&out_intermediate[lane_offset +
                                       (row + elem_idx + i * BLOCK_DIM_Z * 16) *
                                               WIDTH] =
                    *(int4 *)&act_shmem[lane_offset +
                                        (row +
                                         16 * (threadIdx.z + i * BLOCK_DIM_Z)) *
                                                (WIDTH + SKEW)];
        }
    } else {
        // If the output width is larger than 16, we will have used CUTLASS for
        // backpropping through the last layer. Load the resulting gradients.
        threadblock_load_input_static<WIDTH, BLOCK_DIM_Z, N_ITERS>(
                act_shmem, out_intermediate + elem_idx * WIDTH);
    }

    // Backprop through hidden layers
    for (uint32_t k = 0; k < n_hidden_matmuls; ++k) {
        threadblock_layer<WIDTH, BLOCK_DIM_Z, N_ITERS, __half, true>(
                activation, act_shmem,
                weights + layer_stride * (n_hidden_matmuls - k - 1),
                out_intermediate + output_stride * (k + 1) +
                        elem_idx_base * WIDTH,
                forward + output_stride * (n_hidden_matmuls - k - 1) +
                        elem_idx_base * WIDTH);
    }

    // Compute loss gradients w.r.t. input if desired.
    // THIS CODE ASSUMES THAT THE INPUT WIDTH IS THE SAME AS THE NETWORK WIDTH.
    // DON'T PASS A NON-NULL dL_dinput IF THIS REQUIREMENT IS NOT MET.
    if (dL_dinput != nullptr) {
        threadblock_layer<WIDTH, BLOCK_DIM_Z, N_ITERS, __half, true>(
                Activation::None, act_shmem, weights_first_layer,
                dL_dinput + elem_idx_base * WIDTH);
    }
}

//////////////////////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////////////////////
//////////////////////////////////////////////////////////////////////////////////////////////

template <uint32_t WIDTH, bool INFERENCE>  // WIDTH is hidden_dim
void ffmlp_forward_cuda(const __half *inputs,
                        const __half *weights,
                        const uint32_t B,
                        const uint32_t input_dim,
                        const uint32_t output_dim,
                        const uint32_t num_layers,
                        const Activation activation,
                        const Activation output_activation,
                        __half *forward_buffer,
                        __half *outputs) {
    constexpr uint32_t SKEW =
            WIDTH % 16 == 0 ? 8 : 0;  // <- always going to be 8 as we only
                                      // support multiple-of-16 widths
    constexpr uint32_t INPUT_SKEW = 8;  // <- likewise with inputs
    constexpr uint32_t N_BLOCK_ROWS = WIDTH / 16;

    const int N_ITERS = WIDTH >= 256 ? 2 : 8;
    const uint32_t BLOCK_DIM_Z = (INFERENCE && WIDTH == 128) ? 2 : 1;

    const dim3 threads = {
            32u, N_BLOCK_ROWS,
            BLOCK_DIM_Z};  // 32 threads = 1 warp, N_BLOCK_ROWS warps
                           // per block for 16 rows, up to 2x 8 warps
                           // can share input (does not help vs. 1)

    uint32_t n_elems_per_block = 16 * BLOCK_DIM_Z * N_ITERS;
    uint32_t n_blocks = div_round_up(B, n_elems_per_block);

    size_t shmem_size =
            sizeof(__half) * (16 + 16 * BLOCK_DIM_Z * N_ITERS) *
            (WIDTH +
             SKEW);  // 16*WIDTH rows of weights (for the last layer; others
                     // are in registers only) + 16*WIDTH*BLOCK_DIM_Z*N_ITERS
                     // rows of intermediate activations

    // If the input width is dynamic, the input weight matrix as well as part of
    // the input will live in extra shared memory
    if (input_dim != WIDTH) {
        shmem_size = std::max(shmem_size, sizeof(__half) *
                                                  (WIDTH + 16 * BLOCK_DIM_Z) *
                                                  (input_dim + INPUT_SKEW));
    }

    // printf("[ffmlp_forward_cuda] shmem size = %d\n", shmem_size);

    const dim3 blocks = {n_blocks, 1u, 1u};

    check_shmem_error(cudaFuncSetAttribute(
            kernel_mlp_fused<WIDTH, BLOCK_DIM_Z, N_ITERS, __half, INFERENCE>,
            cudaFuncAttributeMaxDynamicSharedMemorySize, (int)shmem_size));

    kernel_mlp_fused<WIDTH, BLOCK_DIM_Z, N_ITERS, __half, INFERENCE>
            <<<blocks, threads, shmem_size, 0>>>(
                    activation, output_activation,
                    inputs,          // CM
                    weights,         // RM
                    forward_buffer,  // CM
                    outputs,         // CM
                    B, input_dim, output_dim, num_layers - 1,
                    nvcuda::wmma::mem_row_major  // reversed outputs's layout
            );
}

template <uint32_t WIDTH>  // WIDTH is hidden_dim
void ffmlp_backward_cuda(const __half *grad,
                         const __half *weights,
                         const uint32_t B,
                         const uint32_t input_dim,
                         const uint32_t output_dim,
                         const uint32_t num_layers,
                         const Activation activation,
                         const __half *forward_buffer,
                         __half *backward_buffer,
                         __half *grad_inputs) {
    // locate
    const __half *weights_first = weights;
    const __half *weights_second = weights + input_dim * WIDTH;

    constexpr uint32_t SKEW =
            WIDTH % 16 == 0 ? 8 : 0;  // <- always going to be 8 as we only
                                      // support multiple-of-16 widths
    constexpr uint32_t N_BLOCKS = WIDTH / 16;

    const int N_ITERS = WIDTH >= 256 ? 2 : 8;
    const uint32_t BLOCK_DIM_Z = 1;

    const dim3 threads = {
            32u, N_BLOCKS,
            BLOCK_DIM_Z};  // 32 threads = 1 warp, N_BLOCK_ROWS warps
                           // per block for 16 rows, up to 2x 8 warps
                           // can share input (does not help vs. 1)

    uint32_t n_elems_per_block = 16 * BLOCK_DIM_Z * N_ITERS;
    uint32_t n_blocks = div_round_up(B, n_elems_per_block);

    size_t shmem_size =
            sizeof(__half) *
            ((16 * BLOCK_DIM_Z * N_ITERS) *
             (WIDTH +
              SKEW));  // WIDTH rows of input and 16 * threads.z rows of weights

    const dim3 blocks = {n_blocks, 1u, 1u};

    // The kernels operate with transposed layouts compared with the MLP code
    check_shmem_error(cudaFuncSetAttribute(
            kernel_mlp_fused_backward<WIDTH, BLOCK_DIM_Z, N_ITERS,
                                      nvcuda::wmma::row_major>,
            cudaFuncAttributeMaxDynamicSharedMemorySize, shmem_size));

    kernel_mlp_fused_backward<WIDTH, BLOCK_DIM_Z, N_ITERS,
                              nvcuda::wmma::row_major>
            <<<blocks, threads, shmem_size, 0>>>(activation,
                                                 grad,             // CM
                                                 weights_second,   // RM
                                                 backward_buffer,  // CM
                                                 forward_buffer,   // CM
                                                 grad_inputs,      // CM
                                                 weights_first,    // RM
                                                 B, output_dim, num_layers - 1);
}

// inputs: col-major [input_dim, B]
// weights: row-major [hidden_dim * input_dim] + [hidden_dim * hidden_dim *
// (num_layers - 1)] + [output_dim * hidden_dim] forward_buffer: col-major
// [num_layers, hidden_dim, B] outputs: col-major [output_dim, B]
void ffmlp_forward(const at::Tensor inputs,
                   const at::Tensor weights,
                   const uint32_t B,
                   const uint32_t input_dim,
                   const uint32_t output_dim,
                   const uint32_t hidden_dim,
                   const uint32_t num_layers,
                   const uint32_t activation_,
                   const uint32_t output_activation_,
                   at::Tensor forward_buffer,
                   at::Tensor outputs) {
    CHECK_CUDA(inputs);
    CHECK_CONTIGUOUS(inputs);
    CHECK_IS_HALF(inputs);

    CHECK_CUDA(weights);
    CHECK_CONTIGUOUS(weights);
    CHECK_IS_HALF(weights);

    Activation activation = convert_activation(activation_);
    Activation output_activation = convert_activation(output_activation_);

    auto inputs_ptr = reinterpret_cast<__half *>(inputs.data_ptr<at::Half>());
    auto weights_ptr = reinterpret_cast<__half *>(weights.data_ptr<at::Half>());
    auto forward_buffer_ptr =
            reinterpret_cast<__half *>(forward_buffer.data_ptr<at::Half>());
    auto outputs_ptr = reinterpret_cast<__half *>(outputs.data_ptr<at::Half>());

    switch (hidden_dim) {
        case 16:
            ffmlp_forward_cuda<16, false>(inputs_ptr, weights_ptr, B, input_dim,
                                          output_dim, num_layers, activation,
                                          output_activation, forward_buffer_ptr,
                                          outputs_ptr);
            break;
        case 32:
            ffmlp_forward_cuda<32, false>(inputs_ptr, weights_ptr, B, input_dim,
                                          output_dim, num_layers, activation,
                                          output_activation, forward_buffer_ptr,
                                          outputs_ptr);
            break;
        case 64:
            ffmlp_forward_cuda<64, false>(inputs_ptr, weights_ptr, B, input_dim,
                                          output_dim, num_layers, activation,
                                          output_activation, forward_buffer_ptr,
                                          outputs_ptr);
            break;
        case 128:
            ffmlp_forward_cuda<128, false>(inputs_ptr, weights_ptr, B,
                                           input_dim, output_dim, num_layers,
                                           activation, output_activation,
                                           forward_buffer_ptr, outputs_ptr);
            break;
        case 256:
            ffmlp_forward_cuda<256, false>(inputs_ptr, weights_ptr, B,
                                           input_dim, output_dim, num_layers,
                                           activation, output_activation,
                                           forward_buffer_ptr, outputs_ptr);
            break;
        default:
            throw std::runtime_error{
                    "hidden_dim should in [16, 32, 64, 128, 256]"};
    }

    // for output_dim > 16
    if (output_dim > 16) {
        fc_multiply<LastLayer, true, false, false>(
                0, output_dim, hidden_dim, B,
                (weights_ptr + hidden_dim * input_dim +
                 (num_layers - 1) * hidden_dim *
                         hidden_dim),  // row-major, [output_dim, hidden_dim]
                (forward_buffer_ptr + (num_layers - 1) * hidden_dim *
                                              B),  // col-major [hidden_dim, B]
                outputs_ptr,                       // col-major [outupt_dim, B]
                output_activation);
    }
}

void ffmlp_inference(const at::Tensor inputs,
                     const at::Tensor weights,
                     const uint32_t B,
                     const uint32_t input_dim,
                     const uint32_t output_dim,
                     const uint32_t hidden_dim,
                     const uint32_t num_layers,
                     const uint32_t activation_,
                     const uint32_t output_activation_,
                     at::Tensor inference_buffer,
                     at::Tensor outputs) {
    CHECK_CUDA(inputs);
    CHECK_CONTIGUOUS(inputs);
    CHECK_IS_HALF(inputs);

    CHECK_CUDA(weights);
    CHECK_CONTIGUOUS(weights);
    CHECK_IS_HALF(weights);

    Activation activation = convert_activation(activation_);
    Activation output_activation = convert_activation(output_activation_);

    auto inputs_ptr = reinterpret_cast<__half *>(inputs.data_ptr<at::Half>());
    auto weights_ptr = reinterpret_cast<__half *>(weights.data_ptr<at::Half>());
    auto inference_buffer_ptr =
            reinterpret_cast<__half *>(inference_buffer.data_ptr<at::Half>());
    auto outputs_ptr = reinterpret_cast<__half *>(outputs.data_ptr<at::Half>());

    switch (hidden_dim) {
        case 16:
            ffmlp_forward_cuda<16, true>(inputs_ptr, weights_ptr, B, input_dim,
                                         output_dim, num_layers, activation,
                                         output_activation,
                                         inference_buffer_ptr, outputs_ptr);
            break;
        case 32:
            ffmlp_forward_cuda<32, true>(inputs_ptr, weights_ptr, B, input_dim,
                                         output_dim, num_layers, activation,
                                         output_activation,
                                         inference_buffer_ptr, outputs_ptr);
            break;
        case 64:
            ffmlp_forward_cuda<64, true>(inputs_ptr, weights_ptr, B, input_dim,
                                         output_dim, num_layers, activation,
                                         output_activation,
                                         inference_buffer_ptr, outputs_ptr);
            break;
        case 128:
            ffmlp_forward_cuda<128, true>(inputs_ptr, weights_ptr, B, input_dim,
                                          output_dim, num_layers, activation,
                                          output_activation,
                                          inference_buffer_ptr, outputs_ptr);
            break;
        case 256:
            ffmlp_forward_cuda<256, true>(inputs_ptr, weights_ptr, B, input_dim,
                                          output_dim, num_layers, activation,
                                          output_activation,
                                          inference_buffer_ptr, outputs_ptr);
            break;
        default:
            throw std::runtime_error{
                    "hidden_dim should in [16, 32, 64, 128, 256]"};
    }

    // for output_dim > 16
    if (output_dim > 16) {
        fc_multiply<LastLayer, true, false, false>(
                0, output_dim, hidden_dim, B,
                (weights_ptr + hidden_dim * input_dim +
                 (num_layers - 1) * hidden_dim *
                         hidden_dim),  // row-major, [output_dim, hidden_dim]
                inference_buffer_ptr,  // col-major [hidden_dim, B]
                outputs_ptr,           // col-major [outupt_dim, B]
                output_activation);
    }
}

inline std::vector<cudaStream_t> &streams_splitk() {
    static std::vector<cudaStream_t> res;
    return res;
}

inline std::vector<cudaEvent_t> &events_splitk() {
    static std::vector<cudaEvent_t> res;
    return res;
}

void allocate_splitk(size_t size) {
    auto &streams = streams_splitk();
    auto &events = events_splitk();
    streams.resize(size);
    events.resize(size);
    for (size_t i = 0; i < size; i++) {
        CUDA_CHECK_THROW(cudaStreamCreate(&streams[i]));
        CUDA_CHECK_THROW(cudaEventCreate(&events[i]));
    }
}

void free_splitk() {
    auto &streams = streams_splitk();
    auto &events = events_splitk();
    for (size_t i = 0; i < streams.size(); i++) {
        cutlass_free_workspace(streams[i]);
        CUDA_CHECK_PRINT(cudaStreamDestroy(streams[i]));
        CUDA_CHECK_PRINT(cudaEventDestroy(events[i]));
    }
}

// grad: col-major [output_dim, B]
// inputs: col-major [input_dim, B]
// weights: row-major [hidden_dim * input_dim] + [hidden_dim * hidden_dim *
// (num_layers - 1)] + [output_dim * hidden_dim] forward_buffer: col-major
// [num_layers, hidden_dim, B] backward_buffer: col-major [num_layers,
// hidden_dim, B] grad_inputs: col-major [input_dim, B] grad_weights: row-major
// [hidden_dim * input_dim] + [hidden_dim * hidden_dim * (num_layers - 1)] +
// [output_dim * hidden_dim]
void ffmlp_backward(const at::Tensor grad,
                    const at::Tensor inputs,
                    const at::Tensor weights,
                    const at::Tensor forward_buffer,
                    const uint32_t B,
                    const uint32_t input_dim,
                    const uint32_t output_dim,
                    const uint32_t hidden_dim,
                    const uint32_t num_layers,
                    const uint32_t activation_,
                    const uint32_t output_activation_,
                    const bool calc_grad_inputs,
                    at::Tensor backward_buffer,
                    at::Tensor grad_inputs,
                    at::Tensor grad_weights) {
    CHECK_CUDA(grad);
    CHECK_CONTIGUOUS(grad);
    CHECK_IS_HALF(grad);

    CHECK_CUDA(inputs);
    CHECK_CONTIGUOUS(inputs);
    CHECK_IS_HALF(inputs);

    CHECK_CUDA(weights);
    CHECK_CONTIGUOUS(weights);
    CHECK_IS_HALF(weights);

    CHECK_CUDA(forward_buffer);
    CHECK_CONTIGUOUS(forward_buffer);
    CHECK_IS_HALF(forward_buffer);

    CHECK_CUDA(backward_buffer);
    CHECK_CONTIGUOUS(backward_buffer);
    CHECK_IS_HALF(backward_buffer);

    CHECK_CUDA(grad_weights);
    CHECK_CONTIGUOUS(grad_weights);
    CHECK_IS_HALF(grad_weights);

    CHECK_CUDA(grad_inputs);
    CHECK_CONTIGUOUS(grad_inputs);
    CHECK_IS_HALF(grad_inputs);

    Activation activation = convert_activation(activation_);
    Activation output_activation = convert_activation(output_activation_);

    // activation_backward_output_gpu (I gonna discard output_activation ...)

    int split_k_factor = B / std::min((uint32_t)(1 << 12), B);

    uint32_t forward_index = num_layers - 1;
    uint32_t backward_index = 0;

    auto backward_buffer_ptr =
            reinterpret_cast<__half *>(backward_buffer.data_ptr<at::Half>());
    auto forward_buffer_ptr =
            reinterpret_cast<__half *>(forward_buffer.data_ptr<at::Half>());
    auto grad_ptr = reinterpret_cast<__half *>(grad.data_ptr<at::Half>());
    auto inputs_ptr = reinterpret_cast<__half *>(inputs.data_ptr<at::Half>());
    auto weights_ptr = reinterpret_cast<__half *>(weights.data_ptr<at::Half>());
    auto grad_weights_ptr =
            reinterpret_cast<__half *>(grad_weights.data_ptr<at::Half>());

    auto grad_inputs_ptr = calc_grad_inputs
                                   ? reinterpret_cast<__half *>(
                                             grad_inputs.data_ptr<at::Half>())
                                   : nullptr;
    auto grad_inputs_fused_ptr =
            input_dim == hidden_dim ? grad_inputs_ptr : nullptr;

    // calc output layer, grad_weights
    cudaEventRecord(events_splitk().at(backward_index), 0);
    cudaStreamWaitEvent(streams_splitk().at(backward_index),
                        events_splitk().at(backward_index), 0);

    fc_multiply_split_k<LastLayerK, false, true, true>(
            streams_splitk().at(backward_index), output_dim, B, hidden_dim,
            grad_ptr,  // col-major, [output_dim, B]
            (forward_buffer_ptr +
             forward_index * hidden_dim * B),  // row-major, [B, hidden_dim]
            (grad_weights_ptr + hidden_dim * input_dim +
             (num_layers - 1) * hidden_dim *
                     hidden_dim),  // row-major, [output_dim, hidden_dim]
            split_k_factor);

    cudaEventRecord(events_splitk().at(backward_index),
                    streams_splitk().at(backward_index));

    // prepare the last backward_buffer if output_dim > 16
    if (output_dim > 16) {
        fc_multiply<FullLayer, false, false, false>(
                0, hidden_dim, output_dim, B,
                (grad_weights_ptr + hidden_dim * input_dim +
                 (num_layers - 1) * hidden_dim *
                         hidden_dim),  // col-major, [hidden_dim, output_dim]
                grad_ptr,              // col-major, [output_dim, B]
                (forward_buffer_ptr +
                 forward_index * hidden_dim * B),  // col-major, [hidden_dim, B]
                (backward_buffer_ptr +
                 backward_index * hidden_dim * B),  // col-major [hidden_dim, B]
                activation, true);
    }

    // prepare backward_buffer
    // calc grad_inputs if input_dim == hidden_dim
    switch (hidden_dim) {
        case 16:
            ffmlp_backward_cuda<16>(grad_ptr, weights_ptr, B, input_dim,
                                    output_dim, num_layers, activation,
                                    forward_buffer_ptr, backward_buffer_ptr,
                                    grad_inputs_fused_ptr);
            break;
        case 32:
            ffmlp_backward_cuda<32>(grad_ptr, weights_ptr, B, input_dim,
                                    output_dim, num_layers, activation,
                                    forward_buffer_ptr, backward_buffer_ptr,
                                    grad_inputs_fused_ptr);
            break;
        case 64:
            ffmlp_backward_cuda<64>(grad_ptr, weights_ptr, B, input_dim,
                                    output_dim, num_layers, activation,
                                    forward_buffer_ptr, backward_buffer_ptr,
                                    grad_inputs_fused_ptr);
            break;
        case 128:
            ffmlp_backward_cuda<128>(grad_ptr, weights_ptr, B, input_dim,
                                     output_dim, num_layers, activation,
                                     forward_buffer_ptr, backward_buffer_ptr,
                                     grad_inputs_fused_ptr);
            break;
        case 256:
            ffmlp_backward_cuda<256>(grad_ptr, weights_ptr, B, input_dim,
                                     output_dim, num_layers, activation,
                                     forward_buffer_ptr, backward_buffer_ptr,
                                     grad_inputs_fused_ptr);
            break;
        default:
            throw std::runtime_error{
                    "hidden_dim should in [16, 32, 64, 128, 256]"};
    }

    // printf("[backward] finished backward kernel\n");

    forward_index--;
    backward_index++;

    // calc middle layer's grad_weights
    for (uint32_t i = 0; i < num_layers - 1; i++) {
        uint32_t matrix_index = num_layers - 2 - i;

        cudaEventRecord(events_splitk().at(backward_index), 0);
        cudaStreamWaitEvent(streams_splitk().at(backward_index),
                            events_splitk().at(backward_index), 0);

        fc_multiply_split_k<FullLayerK, false, true, true>(
                streams_splitk().at(backward_index), hidden_dim, B, hidden_dim,
                (backward_buffer_ptr + (backward_index - 1) * hidden_dim *
                                               B),  // col-major [hidden_dim, B]
                (forward_buffer_ptr +
                 forward_index * hidden_dim * B),  // row-major [B, hidden_dim]
                (grad_weights_ptr + hidden_dim * input_dim +
                 matrix_index * hidden_dim *
                         hidden_dim),  // row-major, [hidden_dim, hidden_dim]
                split_k_factor);

        cudaEventRecord(events_splitk().at(backward_index),
                        streams_splitk().at(backward_index));

        forward_index--;
        backward_index++;
    }

    // calc input layer's grad_weights
    cudaEventRecord(events_splitk().at(backward_index), 0);
    cudaStreamWaitEvent(streams_splitk().at(backward_index),
                        events_splitk().at(backward_index), 0);

    fc_multiply_split_k<FullLayerK, false, true, true>(
            streams_splitk().at(backward_index), hidden_dim, B, input_dim,
            (backward_buffer_ptr + (backward_index - 1) * hidden_dim *
                                           B),  // col-major [hidden_dim, B]
            inputs_ptr,                         // row-major, [B, input_dim]
            grad_weights_ptr,  // row-major, [hidden_dim, input_dim]
            split_k_factor);

    cudaEventRecord(events_splitk().at(backward_index),
                    streams_splitk().at(backward_index));

    // calc grad_inputs if input_dim != hidden_dim
    if (calc_grad_inputs && grad_inputs_fused_ptr == nullptr) {
        fc_multiply<FullLayer, false, false, false>(
                0, input_dim, hidden_dim, B,
                weights_ptr,  // col-major [input_dim, hidden_dim]
                (backward_buffer_ptr + (backward_index - 1) * hidden_dim *
                                               B),  // col-major [hidden_dim, B]
                grad_inputs_ptr                     // col-major [input_dim, B]
        );
    }

    // All the per-layer split-k matrix multiplications summing over
    // the batch are computed in parallel streams to the actual
    // backpropagation. Here, we need to wait for all of these to complete.
    for (auto &event : events_splitk()) {
        cudaStreamWaitEvent(0, event, 0);
    }
}

================================================
FILE: lidarnerf/ffmlp/src/ffmlp.h
================================================
#pragma once

#include <stdint.h>
#include <torch/torch.h>

// activation: should have been enum, here we just use int.
void ffmlp_forward(const at::Tensor inputs,
                   const at::Tensor weights,
                   const uint32_t B,
                   const uint32_t input_dim,
                   const uint32_t output_dim,
                   const uint32_t hidden_dim,
                   const uint32_t num_layers,
                   const uint32_t activation_,
                   const uint32_t output_activation_,
                   at::Tensor forward_buffer,
                   at::Tensor outputs);
void ffmlp_inference(const at::Tensor inputs,
                     const at::Tensor weights,
                     const uint32_t B,
                     const uint32_t input_dim,
                     const uint32_t output_dim,
                     const uint32_t hidden_dim,
                     const uint32_t num_layers,
                     const uint32_t activation_,
                     const uint32_t output_activation_,
                     at::Tensor inference_buffer,
                     at::Tensor outputs);

void ffmlp_backward(const at::Tensor grad,
                    const at::Tensor inputs,
                    const at::Tensor weights,
                    const at::Tensor forward_buffer,
                    const uint32_t B,
                    const uint32_t input_dim,
                    const uint32_t output_dim,
                    const uint32_t hidden_dim,
                    const uint32_t num_layers,
                    const uint32_t activation,
                    const uint32_t output_activation,
                    const bool calc_grad_inputs,
                    at::Tensor backward_buffer,
                    at::Tensor grad_inputs,
                    at::Tensor grad_weights);

void allocate_splitk(size_t size);
void free_splitk();

================================================
FILE: lidarnerf/ffmlp/src/utils.h
================================================
#pragma once

#include <cuda.h>
#include <cuda_fp16.h>
#include <cuda_runtime.h>

#include <array>
#include <atomic>
#include <cassert>
#include <cstdint>
#include <cstdio>
#include <iostream>
#include <sstream>
#include <stdexcept>
#include <string>
#include <vector>

#define CHECK_CUDA(x) \
    TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) \
    TORCH_CHECK(x.is_contiguous(), #x " must be a contiguous tensor")
#define CHECK_IS_INT(x)                                 \
    TORCH_CHECK(x.scalar_type() == at::ScalarType::Int, \
                #x " must be an int tensor")
#define CHECK_IS_FLOATING(x)                                       \
    TORCH_CHECK(x.scalar_type() == at::ScalarType::Float ||        \
                        x.scalar_type() == at::ScalarType::Half || \
                        x.scalar_type() == at::ScalarType::Double, \
                #x " must be a floating tensor")
#define CHECK_IS_HALF(x)                                 \
    TORCH_CHECK(x.scalar_type() == at::ScalarType::Half, \
                #x " must be a Half tensor")

static constexpr uint32_t MIN_GPU_ARCH = 70;

using network_precision_t = __half;

enum class Activation {
    ReLU,
    Exponential,
    Sine,
    Sigmoid,
    Squareplus,
    Softplus,
    None,
};

static constexpr float PI = 3.14159265358979323846f;
static constexpr float SQRT2 = 1.41421356237309504880f;
static constexpr float K_ACT = 10.0f;

__host__ __device__ inline float logistic(const float x) {
    return 1.0f / (1.0f + expf(-x));
}

__host__ __device__ inline float logit(const float x) {
    return -logf(1.0f / (fminf(fmaxf(x, 1e-9f), 1.0f - 1e-9f)) - 1.0f);
}

inline std::atomic<size_t> &total_n_bytes_allocated() {
    static std::atomic<size_t> s_total_n_bytes_allocated{0};
    return s_total_n_bytes_allocated;
}

/// Checks the result of a cudaXXXXXX call and throws an error on failure
#define CUDA_CHECK_THROW(x)                                                \
    do {                                                                   \
        cudaError_t result = x;                                            \
        if (result != cudaSuccess)                                         \
            throw std::runtime_error(                                      \
                    std::string("CUDA Error: " #x " failed with error ") + \
                    cudaGetErrorString(result));                           \
    } while (0)

/// Checks the result of a cudaXXXXXX call and prints an error on failure
#define CUDA_CHECK_PRINT(x)                                       \
    do {                                                          \
        cudaError_t result = x;                                   \
        if (result != cudaSuccess)                                \
            std::cout << "CUDA Error: " #x " failed with error "  \
                      << cudaGetErrorString(result) << std::endl; \
    } while (0)

#define DEBUG_GUARD_SIZE 0

/// Managed memory on the Device
template <class T>
class GPUMemory {
private:
    T *m_data = nullptr;
    size_t m_size = 0;  // Number of elements
    bool m_owned = true;

public:
    GPUMemory() {}

    GPUMemory<T> &operator=(GPUMemory<T> &&other) {
        std::swap(m_data, other.m_data);
        std::swap(m_size, other.m_size);
        return *this;
    }

    GPUMemory(GPUMemory<T> &&other) { *this = std::move(other); }

    __host__ __device__ GPUMemory(const GPUMemory<T> &other)
        : m_data{other.m_data}, m_size{other.m_size}, m_owned{false} {}

    void check_guards() const {
#if DEBUG_GUARD_SIZE > 0
        if (!m_data) return;
        uint8_t buf[DEBUG_GUARD_SIZE];
        const uint8_t *rawptr = (const uint8_t *)m_data;
        cudaMemcpy(buf, rawptr - DEBUG_GUARD_SIZE, DEBUG_GUARD_SIZE,
                   cudaMemcpyDeviceToHost);
        for (int i = 0; i < DEBUG_GUARD_SIZE; ++i)
            if (buf[i] != 0xff) {
                printf("TRASH BEFORE BLOCK offset %d data %p, read 0x%02x "
                       "expected "
                       "0xff!\n",
                       i, m_data, buf[i]);
                break;
            }
        cudaMemcpy(buf, rawptr + m_size * sizeof(T), DEBUG_GUARD_SIZE,
                   cudaMemcpyDeviceToHost);
        for (int i = 0; i < DEBUG_GUARD_SIZE; ++i)
            if (buf[i] != 0xfe) {
                printf("TRASH AFTER BLOCK offset %d data %p, read 0x%02x "
                       "expected 0xfe!\n",
                       i, m_data, buf[i]);
                break;
            }
#endif
    }

    void allocate_memory(size_t n_bytes) {
        if (n_bytes == 0) {
            return;
        }

#ifdef TCNN_VERBOSE_MEMORY_ALLOCS
        std::cout << "GPUMemory: Allocating " << bytes_to_string(n_bytes) << "."
                  << std::endl;
#endif

        uint8_t *rawptr = nullptr;
        CUDA_CHECK_THROW(cudaMalloc(&rawptr, n_bytes + DEBUG_GUARD_SIZE * 2));
#if DEBUG_GUARD_SIZE > 0
        CUDA_CHECK_THROW(cudaMemset(rawptr, 0xff, DEBUG_GUARD_SIZE));
        CUDA_CHECK_THROW(cudaMemset(rawptr + n_bytes + DEBUG_GUARD_SIZE, 0xfe,
                                    DEBUG_GUARD_SIZE));
#endif
        if (rawptr) rawptr += DEBUG_GUARD_SIZE;
        m_data = (T *)(rawptr);
        total_n_bytes_allocated() += n_bytes;
    }

    void free_memory() {
        if (!m_data) {
            return;
        }

        uint8_t *rawptr = (uint8_t *)m_data;
        if (rawptr) rawptr -= DEBUG_GUARD_SIZE;
        CUDA_CHECK_THROW(cudaFree(rawptr));

        total_n_bytes_allocated() -= get_bytes();

        m_data = nullptr;
    }

    /// Allocates memory for size items of type T
    GPUMemory(const size_t size) { resize(size); }

    /// Frees memory again
    __host__ __device__ ~GPUMemory() {
#ifndef __CUDA_ARCH__
        if (!m_owned) {
            return;
        }

        try {
            if (m_data) {
                free_memory();
                m_size = 0;
            }
        } catch (std::runtime_error error) {
            // Don't need to report on memory-free problems when the driver is
            // shutting down.
            if (std::string{error.what()}.find("driver shutting down") ==
                std::string::npos) {
                fprintf(stderr, "Could not free memory: %s\n", error.what());
            }
        }
#endif
    }

    /** @name Resizing/enlargement
     *  @{
     */
    /// Resizes the array to the exact new size, even if it is already larger
    void resize(const size_t size) {
        if (!m_owned) {
            throw std::runtime_error("Cannot resize non-owned memory.");
        }

        if (m_size != size) {
            if (m_size) {
                try {
                    free_memory();
                } catch (std::runtime_error error) {
                    throw std::runtime_error(
                            std::string("Could not free memory: ") +
                            error.what());
                }
            }

            if (size > 0) {
                try {
                    allocate_memory(size * sizeof(T));
                } catch (std::runtime_error error) {
                    throw std::runtime_error(
                            std::string("Could not allocate memory: ") +
                            error.what());
                }
            }

            m_size = size;
        }
    }

    /// Enlarges the array if its size is smaller
    void enlarge(const size_t size) {
        if (size > m_size) {
            resize(size);
        }
    }
    /** @} */

    /** @name Memset
     *  @{
     */
    /// Sets the memory of the first num_elements to value
    void memset(const int value,
                const size_t num_elements,
                const size_t offset = 0) {
        if (num_elements + offset > m_size) {
            throw std::runtime_error(
                    "Could not set memory: Number of elements "
                    "larger than allocated memory");
        }

        try {
            CUDA_CHECK_THROW(cudaMemset(m_data + offset, value,
                                        num_elements * sizeof(T)));
        } catch (std::runtime_error error) {
            throw std::runtime_error(std::string("Could not set memory: ") +
                                     error.what());
        }
    }

    /// Sets the memory of the all elements to value
    void memset(const int value) { memset(value, m_size); }
    /** @} */

    /** @name Copy operations
     *  @{
     */
    /// Copy data of num_elements from the raw pointer on the host
    void copy_from_host(const T *host_data, const size_t num_elements) {
        try {
            CUDA_CHECK_THROW(cudaMemcpy(data(), host_data,
                                        num_elements * sizeof(T),
                                        cudaMemcpyHostToDevice));
        } catch (std::runtime_error error) {
            throw std::runtime_error(std::string("Could not copy from host: ") +
                                     error.what());
        }
    }

    /// Copy num_elements from the host vector
    void copy_from_host(const std::vector<T> &data, const size_t num_elements) {
        if (data.size() < num_elements) {
            throw std::runtime_error(
                    std::string("Trying to copy ") +
                    std::to_string(num_elements) +
                    std::string(" elements, but vector size is only ") +
                    std::to_string(data.size()));
        }
        copy_from_host(data.data(), num_elements);
    }

    /// Copies data from the raw host pointer to fill the entire array
    void copy_from_host(const T *data) { copy_from_host(data, m_size); }

    /// Copies num_elements of data from the raw host pointer after enlarging
    /// the array so that everything fits in
    void enlarge_and_copy_from_host(const T *data, const size_t num_elements) {
        enlarge(num_elements);
        copy_from_host(data, num_elements);
    }

    /// Copies num_elements from the host vector after enlarging the array so
    /// that everything fits in
    void enlarge_and_copy_from_host(const std::vector<T> &data,
                                    const size_t num_elements) {
        enlarge_and_copy_from_host(data.data(), num_elements);
    }

    /// Copies the entire host vector after enlarging the array so that
    /// everything fits in
    void enlarge_and_copy_from_host(const std::vector<T> &data) {
        enlarge_and_copy_from_host(data.data(), data.size());
    }

    /// Copies num_elements of data from the raw host pointer after resizing the
    /// array
    void resize_and_copy_from_host(const T *data, const size_t num_elements) {
        resize(num_elements);
        copy_from_host(data, num_elements);
    }

    /// Copies num_elements from the host vector after resizing the array
    void resize_and_copy_from_host(const std::vector<T> &data,
                                   const size_t num_elements) {
        resize_and_copy_from_host(data.data(), num_elements);
    }

    /// Copies the entire host vector after resizing the array
    void resize_and_copy_from_host(const std::vector<T> &data) {
        resize_and_copy_from_host(data.data(), data.size());
    }

    /// Copies the entire host vector to the device. Fails if there is not
    /// enough space available.
    void copy_from_host(const std::vector<T> &data) {
        if (data.size() < m_size) {
            throw std::runtime_error(
                    std::string("Trying to copy ") + std::to_string(m_size) +
                    std::string(" elements, but vector size is only ") +
                    std::to_string(data.size()));
        }
        copy_from_host(data.data(), m_size);
    }

    /// Copies num_elements of data from the raw host pointer to the device.
    /// Fails if there is not enough space available.
    void copy_to_host(T *host_data, const size_t num_elements) const {
        if (num_elements > m_size) {
            throw std::runtime_error(
                    std::string("Trying to copy ") +
                    std::to_string(num_elements) +
                    std::string(" elements, but vector size is only ") +
                    std::to_string(m_size));
        }
        try {
            CUDA_CHECK_THROW(cudaMemcpy(host_data, data(),
                                        num_elements * sizeof(T),
                                        cudaMemcpyDeviceToHost));
        } catch (std::runtime_error error) {
            throw std::runtime_error(std::string("Could not copy to host: ") +
                                     error.what());
        }
    }

    /// Copies num_elements from the device to a vector on the host
    void copy_to_host(std::vector<T> &data, const size_t num_elements) const {
        if (data.size() < num_elements) {
            throw std::runtime_error(
                    std::string("Trying to copy ") +
                    std::to_string(num_elements) +
                    std::string(" elements, but vector size is only ") +
                    std::to_string(data.size()));
        }
        copy_to_host(data.data(), num_elements);
    }

    /// Copies num_elements from the device to a raw pointer on the host
    void copy_to_host(T *data) const { copy_to_host(data, m_size); }

    /// Copies all elements from the device to a vector on the host
    void copy_to_host(std::vector<T> &data) const {
        if (data.size() < m_size) {
            throw std::runtime_error(
                    std::string("Trying to copy ") + std::to_string(m_size) +
                    std::string(" elements, but vector size is only ") +
                    std::to_string(data.size()));
        }
        copy_to_host(data.data(), m_size);
    }

    /// Copies data from another device array to this one, automatically
    /// resizing it
    void copy_from_device(const GPUMemory<T> &other) {
        if (m_size != other.m_size) {
            resize(other.m_size);
        }

        try {
            CUDA_CHECK_THROW(cudaMemcpy(m_data, other.m_data,
                                        m_size * sizeof(T),
                                        cudaMemcpyDeviceToDevice));
        } catch (std::runtime_error error) {
            throw std::runtime_error(
                    std::string("Could not copy from device: ") + error.what());
        }
    }

    /// Copies size elements from another device array to this one,
    /// automatically resizing it
    void copy_from_device(const GPUMemory<T> &other, const size_t size) {
        if (m_size < size) {
            resize(size);
        }

        try {
            CUDA_CHECK_THROW(cudaMemcpy(m_data, other.m_data, size * sizeof(T),
                                        cudaMemcpyDeviceToDevice));
        } catch (std::runtime_error error) {
            throw std::runtime_error(
                    std::string("Could not copy from device: ") + error.what());
        }
    }

    // Created an (owned) copy of the data
    GPUMemory<T> copy() const {
        GPUMemory<T> result{m_size};
        result.copy_from_device(*this);
        return result;
    }

    T *data() const {
        check_guards();
        return m_data;
    }

    __host__ __device__ T &operator[](size_t idx) const {
#ifdef DEBUG_BUFFER_OVERRUN
        if (idx > m_size) {
            printf("WARNING: buffer overrun of %p at idx %zu\n", idx);
        }
#endif
        return m_data[idx];
    }

    __host__ __device__ T &operator[](uint32_t idx) const {
#ifdef DEBUG_BUFFER_OVERRUN
        if (idx > m_size) {
            printf("WARNING: buffer overrun of %p at idx %u\n", idx);
        }
#endif
        return m_data[idx];
    }

    size_t get_num_elements() const { return m_size; }

    size_t size() const { return get_num_elements(); }

    size_t get_bytes() const { return m_size * sizeof(T); }

    size_t bytes() const { return get_bytes(); }
};

inline std::string bytes_to_string(size_t bytes) {
    std::array<std::string, 7> suffixes = {
            {"B", "KB", "MB", "GB", "TB", "PB", "EB"}};

    double count = (double)bytes;
    uint32_t i = 0;
    for (; i < suffixes.size() && count >= 1024; ++i) {
        count /= 1024;
    }

    std::ostringstream oss;
    oss.precision(3);
    oss << count << " " << suffixes[i];
    return oss.str();
}

template <typename T, typename fragment_t>
__host__ __device__ void warp_activation(Activation activation,
                                         const fragment_t &frag,
                                         fragment_t &result) {
    switch (activation) {
        case Activation::ReLU:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * (T)((T)frag.x[t] > (T)0.0f);
            }
            return;
        case Activation::Exponential:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = (T)(expf((float)frag.x[t]));
            }
            return;
        case Activation::Sine:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = (T)(sinf((float)frag.x[t]));
            }
            return;
        case Activation::Sigmoid:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = (T)(logistic((float)frag.x[t]));
            }
            return;
        case Activation::Squareplus:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                float x = (float)frag.x[t] * K_ACT;
                result.x[t] = (T)(0.5f * (x + sqrtf(x * x + 4)) / K_ACT);
            }
            return;
        case Activation::Softplus:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = (T)(logf(expf((float)frag.x[t] * K_ACT) + 1.0f) /
                                  K_ACT);
            }
            return;
        case Activation::None:
            result = frag;
            return;
        default:
            // Unsupported activation
            // assert(false); // Commented out due to isolated strange
            // side-effects on Windows
            return;
    }
}

template <typename T, typename fragment_t>
__host__ __device__ fragment_t warp_activation(Activation activation,
                                               const fragment_t &frag) {
    fragment_t result;
    warp_activation<T>(activation, frag, result);
    return result;
}

template <typename T, typename fragment_t, typename forward_fragment_t>
__host__ __device__ void warp_activation_backward_in(
        Activation activation,
        const fragment_t &frag,
        const forward_fragment_t &forward_frag_in,
        fragment_t &result) {
    switch (activation) {
        case Activation::ReLU:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * (T)(forward_frag_in.x[t] > (T)0.0f);
            }
            return;
        case Activation::Exponential:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * (T)(expf(forward_frag_in.x[t]));
            }
            return;
        case Activation::Sine:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * (T)(cosf(forward_frag_in.x[t]));
            }
            return;
        case Activation::Sigmoid:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                float x = logistic(forward_frag_in.x[t]);
                result.x[t] = frag.x[t] * (T)(x * (1.0f - x));
            }
            return;
        case Activation::Squareplus:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                float x = (float)forward_frag_in.x[t] * K_ACT;
                float y = 0.5f * (x + sqrtf(x * x + 4));
                result.x[t] = frag.x[t] * (T)(y * y / (y * y + 1));
            }
            return;
        case Activation::Softplus:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                float tmp = expf((float)frag.x[t] * K_ACT);
                result.x[t] = frag.x[t] * (T)(tmp / (tmp + 1));
            }
            return;
        case Activation::None:
            result = frag;
            return;
        default:
            // Unsupported activation
            // assert(false); // Commented out due to isolated strange
            // side-effects on Windows
            return;
    }
}

template <typename T, typename fragment_t, typename forward_fragment_t>
__host__ __device__ fragment_t
warp_activation_backward_in(Activation activation,
                            const fragment_t &frag,
                            const forward_fragment_t &forward_frag_in) {
    fragment_t result;
    warp_activation_backward_in<T>(activation, frag, forward_frag_in, result);
    return result;
}

template <typename T, typename fragment_t, typename forward_fragment_t>
__host__ __device__ void warp_activation_backward(
        Activation activation,
        const fragment_t &frag,
        const forward_fragment_t &forward_frag,
        fragment_t &result) {
    switch (activation) {
        case Activation::ReLU:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * (T)(forward_frag.x[t] > (T)0.0f);
            }
            return;
        case Activation::Exponential:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * forward_frag.x[t];
            }
            return;
        case Activation::Sine:
            // Sine requires stored pre-activations, which we don't have. We
            // only write out the post-activations. assert(false); // Commented
            // out due to isolated strange side-effects on Windows
            return;
        case Activation::Sigmoid:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] = frag.x[t] * (T)(forward_frag.x[t] *
                                              ((T)1.0f - forward_frag.x[t]));
            }
            return;
        case Activation::Squareplus:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                float y = (float)forward_frag.x[t] * K_ACT;
                result.x[t] = frag.x[t] * (T)(y * y / (y * y + 1));
            }
            return;
        case Activation::Softplus:
#pragma unroll
            for (int t = 0; t < result.num_elements; t++) {
                result.x[t] =
                        frag.x[t] *
                        (T)(1.0f - expf(-(float)forward_frag.x[t] * K_ACT));
            }
            return;
        case Activation::None:
            result = frag;
            return;
        default:
            // Unsupported activation
            // assert(false); // Commented out due to isolated strange
            // side-effects on Windows
            return;
    }
}

template <typename T, typename fragment_t, typename forward_fragment_t>
__host__ __device__ fragment_t
warp_activation_backward(Activation activation,
                         const fragment_t &frag,
                         const forward_fragment_t &forward_frag) {
    fragment_t result;
    warp_activation_backward<T>(activation, frag, forward_frag, result);
    return result;
}

================================================
FILE: lidarnerf/freqencoder/__init__.py
================================================


================================================
FILE: lidarnerf/freqencoder/backend.py
================================================
import os
from torch.utils.cpp_extension import load

_src_path = os.path.dirname(os.path.abspath(__file__))

nvcc_flags = [
    "-O3",
    "-std=c++14",
    "-U__CUDA_NO_HALF_OPERATORS__",
    "-U__CUDA_NO_HALF_CONVERSIONS__",
    "-U__CUDA_NO_HALF2_OPERATORS__",
    "-use_fast_math",
]

if os.name == "posix":
    c_flags = ["-O3", "-std=c++14"]
elif os.name == "nt":
    c_flags = ["/O2", "/std:c++17"]

    # find cl.exe
    def find_cl_path():
        import glob

        for edition in ["Enterprise", "Professional", "BuildTools", "Community"]:
            paths = sorted(
                glob.glob(
                    r"C:\\Program Files (x86)\\Microsoft Visual Studio\\*\\%s\\VC\\Tools\\MSVC\\*\\bin\\Hostx64\\x64"
                    % edition
                ),
                reverse=True,
            )
            if paths:
                return paths[0]

    # If cl.exe is not on path, try to find it.
    if os.system("where cl.exe >nul 2>nul") != 0:
        cl_path = find_cl_path()
        if cl_path is None:
            raise RuntimeError(
                "Could not locate a supported Microsoft Visual C++ installation"
            )
        os.environ["PATH"] += ";" + cl_path

_backend = load(
    name="_freqencoder",
    extra_cflags=c_flags,
    extra_cuda_cflags=nvcc_flags,
    sources=[
        os.path.join(_src_path, "src", f)
        for f in [
            "freqencoder.cu",
            "bindings.cpp",
        ]
    ],
)

__all__ = ["_backend"]


================================================
FILE: lidarnerf/freqencoder/freq.py
================================================
import torch
import torch.nn as nn
from torch.autograd import Function
from torch.cuda.amp import custom_bwd, custom_fwd

try:
    import _freqencoder as _backend
except ImportError:
    from .backend import _backend


class _freq_encoder(Function):
    @staticmethod
    @custom_fwd(cast_inputs=torch.float32)  # force float32 for better precision
    def forward(ctx, inputs, degree, output_dim):
        # inputs: [B, input_dim], float
        # RETURN: [B, F], float

        if not inputs.is_cuda:
            inputs = inputs.cuda()
        inputs = inputs.contiguous()

        B, input_dim = inputs.shape  # batch size, coord dim

        outputs = torch.empty(B, output_dim, dtype=inputs.dtype, device=inputs.device)

        _backend.freq_encode_forward(inputs, B, input_dim, degree, output_dim, outputs)

        ctx.save_for_backward(inputs, outputs)
        ctx.dims = [B, input_dim, degree, output_dim]

        return outputs

    @staticmethod
    # @once_differentiable
    @custom_bwd
    def backward(ctx, grad):
        # grad: [B, C * C]

        grad = grad.contiguous()
        inputs, outputs = ctx.saved_tensors
        B, input_dim, degree, output_dim = ctx.dims

        grad_inputs = torch.zeros_like(inputs)
        _backend.freq_encode_backward(
            grad, outputs, B, input_dim, degree, output_dim, grad_inputs
        )

        return grad_inputs, None, None


freq_encode = _freq_encoder.apply


class FreqEncoder(nn.Module):
    def __init__(self, input_dim=3, degree=4):
        super().__init__()

        self.input_dim = input_dim
        self.degree = degree
        self.output_dim = input_dim + input_dim * 2 * degree

    def __repr__(self):
        return f"FreqEncoder: input_dim={self.input_dim} degree={self.degree} output_dim={self.output_dim}"

    def forward(self, inputs, **kwargs):
        # inputs: [..., input_dim]
        # return: [..., ]

        prefix_shape = list(inputs.shape[:-1])
        inputs = inputs.reshape(-1, self.input_dim)

        outputs = freq_encode(inputs, self.degree, self.output_dim)

        outputs = outputs.reshape(prefix_shape + [self.output_dim])

        return outputs


================================================
FILE: lidarnerf/freqencoder/setup.py
================================================
import os
from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CUDAExtension

_src_path = os.path.dirname(os.path.abspath(__file__))

nvcc_flags = [
    "-O3",
    "-std=c++14",
    "-U__CUDA_NO_HALF_OPERATORS__",
    "-U__CUDA_NO_HALF_CONVERSIONS__",
    "-U__CUDA_NO_HALF2_OPERATORS__",
    "-use_fast_math",
]

if os.name == "posix":
    c_flags = ["-O3", "-std=c++14"]
elif os.name == "nt":
    c_flags = ["/O2", "/std:c++17"]

    # find cl.exe
    def find_cl_path():
        import glob

        for edition in ["Enterprise", "Professional", "BuildTools", "Community"]:
            paths = sorted(
                glob.glob(
                    r"C:\\Program Files (x86)\\Microsoft Visual Studio\\*\\%s\\VC\\Tools\\MSVC\\*\\bin\\Hostx64\\x64"
                    % edition
                ),
                reverse=True,
            )
            if paths:
                return paths[0]

    # If cl.exe is not on path, try to find it.
    if os.system("where cl.exe >nul 2>nul") != 0:
        cl_path = find_cl_path()
        if cl_path is None:
            raise RuntimeError(
                "Could not locate a supported Microsoft Visual C++ installation"
            )
        os.environ["PATH"] += ";" + cl_path

setup(
    name="freqencoder",  # package name, import this to use python API
    ext_modules=[
        CUDAExtension(
            name="_freqencoder",  # extension name, import this to use CUDA API
            sources=[
                os.path.join(_src_path, "src", f)
                for f in [
                    "freqencoder.cu",
                    "bindings.cpp",
                ]
            ],
            extra_compile_args={
                "cxx": c_flags,
                "nvcc": nvcc_flags,
            },
        ),
    ],
    cmdclass={
        "build_ext": BuildExtension,
    },
)


================================================
FILE: lidarnerf/freqencoder/src/bindings.cpp
================================================
#include <torch/extension.h>

#include "freqencoder.h"

PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {
    m.def("freq_encode_forward", &freq_encode_forward,
          "freq encode forward (CUDA)");
    m.def("freq_encode_backward", &freq_encode_backward,
          "freq encode backward (CUDA)");
}

================================================
FILE: lidarnerf/freqencoder/src/freqencoder.cu
================================================
#include <ATen/cuda/CUDAContext.h>
#include <cuda.h>
#include <cuda_fp16.h>
#include <cuda_runtime.h>
#include <stdint.h>
#include <torch/torch.h>

#include <algorithm>
#include <cstdio>
#include <stdexcept>

#define CHECK_CUDA(x) \
    TORCH_CHECK(x.device().is_cuda(), #x " must be a CUDA tensor")
#define CHECK_CONTIGUOUS(x) \
    TORCH_CHECK(x.is_contiguous(), #x " must be a contiguous tensor")
#define CHECK_IS_INT(x)                                 \
    TORCH_CHECK(x.scalar_type() == at::ScalarType::Int, \
                #x " must be an int tensor")
#define CHECK_IS_FLOATING(x)                                       \
    TORCH_CHECK(x.scalar_type() == at::ScalarType::Float ||        \
                        x.scalar_type() == at::ScalarType::Half || \
                        x.scalar_type() == at::ScalarType::Double, \
                #x " must be a floating tensor")

inline constexpr __device__ float PI() { return 3.141592653589793f; }

template <typename T>
__host__ __device__ T div_round_up(T val, T divisor) {
    return (val + divisor - 1) / divisor;
}

// inputs: [B, D]
// outputs: [B, C], C = D + D * deg * 2
__global__ void kernel_freq(const float *__restrict__ inputs,
                            uint32_t B,
                            uint32_t D,
                            uint32_t deg,
                            uint32_t C,
                            float *outputs) {
    // parallel on per-element
    const uint32_t t = threadIdx.x + blockIdx.x * blockDim.x;
    if (t >= B * C) return;

    // get index
    const uint32_t b = t / C;
    const uint32_t c = t - b * C;  // t % C;

    // locate
    inputs += b * D;
    outputs += t;

    // write self
    if (c < D) {
        outputs[0] = inputs[c];
        // write freq
    } else {
        const uint32_t col = c / D - 1;
        const uint32_t d = c % D;
        const uint32_t freq = col / 2;
        const float phase_shift = (col % 2) * (PI() / 2);
        outputs[0] = __sinf(scalbnf(inputs[d], freq) + phase_shift);
    }
}

// grad: [B, C], C = D + D * deg * 2
// outputs: [B, C]
// grad_inputs: [B, D]
__global__ void kernel_freq_backward(const float *__restrict__ grad,
                                     const float *__restrict__ outputs,
                                     uint32_t B,
                                     uint32_t D,
                                     uint32_t deg,
                                     uint32_t C,
                                     float *grad_inputs) {
    // parallel on per-element
    const uint32_t t = threadIdx.x + blockIdx.x * blockDim.x;
    if (t >= B * D) return;

    const uint32_t b = t / D;
    const uint32_t d = t - b * D;  // t % D;

    // locate
    grad += b * C;
    outputs += b * C;
    grad_inputs += t;

    // register
    float result = grad[d];
    grad += D;
    outputs += D;

    for (uint32_t f = 0; f < deg; f++) {
        result += scalbnf(1.0f, f) *
                  (grad[d] * outputs[D + d] - grad[D + d] * outputs[d]);
        grad += 2 * D;
        outputs += 2 * D;
    }

    // write
    grad_inputs[0] = result;
}

void freq_encode_forward(at::Tensor inputs,
                         const uint32_t B,
                         const uint32_t D,
                         const uint32_t deg,
                         const uint32_t C,
                         at::Tensor outputs) {
    CHECK_CUDA(inputs);
    CHECK_CUDA(outputs);

    CHECK_CONTIGUOUS(inputs);
    CHECK_CONTIGUOUS(outputs);

    CHECK_IS_FLOATING(inputs);
    CHECK_IS_FLOATING(outputs);

    static constexpr uint32_t N_THREADS = 128;

    kernel_freq<<<div_round_up(B * C, N_THREADS), N_THREADS>>>(
            inputs.data_ptr<float>(), B, D, deg, C, outputs.data_ptr<float>());
}

void freq_encode_backward(at::Tensor grad,
                          at::Tensor outputs,
                          const uint32_t B,
                          const uint32_t D,
                          const uint32_t deg,
                          const uint32_t C,
                          at::Tensor grad_inputs) {
    CHECK_CUDA(grad);
    CHECK_CUDA(outputs);
    CHECK_CUDA(grad_inputs);

    CHECK_CONTIGUOUS(grad);
    CHECK_CONTIGUOUS(outputs);
    CHECK_CONTIGUOUS(grad_inputs);

    CHECK_IS_FLOATING(grad);
    CHECK_IS_FLOATING(outputs);
    CHECK_IS_FLOATING(grad_inputs);

    static constexpr uint32_t N_THREADS = 128;

    kernel_freq_backward<<<div_round_up(B * D, N_THREADS), N_THREADS>>>(
            grad.data_ptr<float>(), outputs.data_ptr<float>(), B, D, deg, C,
            grad_inputs.data_ptr<float>());
}

================================================
FILE: lidarnerf/freqencoder/src/freqencoder.h
================================================
#pragma once

#include <stdint.h>
#include <torch/torch.h>

// _backend.freq_encode_forward(inputs, B, input_dim, degree, output_dim,
// outputs)
void freq_encode_forward(at::Tensor inputs,
                         const uint32_t B,
                         const uint32_t D,
                         const uint32_t deg,
                         const uint32_t C,
                         at::Tensor outputs);

// _backend.freq_encode_backward(grad, outputs, B, input_dim, degree,
// output_dim, grad_inputs)
void freq_encode_backward(at::Tensor grad,
                          at::Tensor outputs,
                          const uint32_t B,
                          const uint32_t D,
                          const uint32_t deg,
                          const uint32_t C,
                          at::Tensor grad_inputs);

================================================
FILE: lidarnerf/gridencoder/__init__.py
================================================


================================================
FILE: lidarnerf/gridencoder/backend.py
================================================
import os
from torch.utils.cpp_extension import load

_src_path = os.path.dirname(os.path.abspath(__file__))

nvcc_flags = [
    "-O3",
    "-std=c++14",
    "-U__CUDA_NO_HALF_OPERATORS__",
    "-U__CUDA_NO_HALF_CONVERSIONS__",
    "-U__CUDA_NO_HALF2_OPERATORS__",
]

if os.name == "posix":
    c_flags = ["-O3", "-std=c++14"]
elif os.name == "nt":
    c_flags = ["/O2", "/std:c++17"]

    # find cl.exe
    def find_cl_path():
        import glob

        for edition in ["Enterprise", "Professional", "BuildTools", "Community"]:
            paths = sorted(
                glob.glob(
                    r"C:\\Program Files (x86)\\Microsoft Visual Studio\\*\\%s\\VC\\Tools\\MSVC\\*\\bin\\Hostx64\\x64"
                    % edition
                ),
                reverse=True,
            )
            if paths:
                return paths[0]

    # If cl.exe is not on path, try to find it.
    if os.system("where cl.exe >nul 2>nul") != 0:
        cl_path = find_cl_path()
        if cl_path is None:
            raise RuntimeError(
                "Could not locate a supported Microsoft Visual C++ installation"
            )
        os.environ["PATH"] += ";" + cl_path

_backend = load(
    name="_grid_encoder",
    extra_cflags=c_flags,
    extra_cuda_cflags=nvcc_flags,
    sources=[
        os.path.join(_src_path, "src", f)
        for f in [
            "gridencoder.cu",
            "bindings.cpp",
        ]
    ],
)

__all__ = ["_backend"]


================================================
FILE: lidarnerf/gridencoder/grid.py
================================================
import numpy as np

import torch
import torch.nn as nn
from torch.autograd import Function
from torch.cuda.amp import custom_bwd, custom_fwd

try:
    import _gridencoder as _backend
except ImportError:
    from .backend import _backend

_gridtype_to_id = {
    "hash": 0,
    "tiled": 1,
}

_interp_to_id = {
    "linear": 0,
    "smoothstep": 1,
}


class _grid_encode(Function):
    @staticmethod
    @custom_fwd
    def forward(
        ctx,
        inputs,
        embeddings,
        offsets,
        per_level_scale,
        base_resolution,
        calc_grad_inputs=False,
        gridtype=0,
        align_c

Download .txt

gitextract_gza2uyyg/

├── .github/
│   └── workflows/
│       └── formatter.yml
├── LICENSE
├── configs/
│   ├── kitti360_1538.txt
│   ├── kitti360_1728.txt
│   ├── kitti360_1908.txt
│   ├── kitti360_3353.txt
│   └── nerf_mvl.txt
├── extern/
│   ├── chamfer3D/
│   │   ├── chamfer3D.cu
│   │   ├── chamfer_cuda.cpp
│   │   ├── dist_chamfer_3D.py
│   │   └── setup.py
│   └── fscore.py
├── lidarmvl/
│   └── readme.md
├── lidarnerf/
│   ├── __init__.py
│   ├── activation.py
│   ├── convert.py
│   ├── dataset/
│   │   ├── base_dataset.py
│   │   ├── kitti360_dataset.py
│   │   └── nerfmvl_dataset.py
│   ├── encoding.py
│   ├── ffmlp/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── ffmlp.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── cutlass_matmul.h
│   │       ├── ffmlp.cu
│   │       ├── ffmlp.h
│   │       └── utils.h
│   ├── freqencoder/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── freq.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── freqencoder.cu
│   │       └── freqencoder.h
│   ├── gridencoder/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── grid.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── gridencoder.cu
│   │       └── gridencoder.h
│   ├── loss.py
│   ├── nerf/
│   │   ├── network.py
│   │   ├── network_tcnn.py
│   │   ├── renderer.py
│   │   └── utils.py
│   ├── raymarching/
│   │   ├── __init__.py
│   │   ├── backend.py
│   │   ├── raymarching.py
│   │   ├── setup.py
│   │   └── src/
│   │       ├── bindings.cpp
│   │       ├── raymarching.cu
│   │       └── raymarching.h
│   └── shencoder/
│       ├── __init__.py
│       ├── backend.py
│       ├── setup.py
│       ├── sphere_harmonics.py
│       └── src/
│           ├── bindings.cpp
│           ├── shencoder.cu
│           └── shencoder.h
├── lidarnvs/
│   ├── __init__.py
│   ├── configs/
│   │   ├── pcgen_kitti360_raydrop.txt
│   │   └── pcgen_nerfmvl_raydrop.txt
│   ├── eval.py
│   ├── lidarnvs_base.py
│   ├── lidarnvs_meshing.py
│   ├── lidarnvs_nksr.py
│   ├── lidarnvs_pcgen.py
│   ├── lidarnvs_poisson.py
│   ├── loader.py
│   ├── plot_possion_grid_search.py
│   ├── raydrop_dataset_poisson.py
│   ├── raydrop_train_pcgen.py
│   ├── raydrop_train_poisson.py
│   ├── readme.md
│   ├── run.py
│   └── unet.py
├── main_lidarnerf.py
├── preprocess/
│   ├── cal_centerpose_bound.py
│   ├── generate_train_rangeview.py
│   ├── kitti360_loader.py
│   ├── kitti360_to_nerf.py
│   ├── nerfmvl_loader.py
│   └── nerfmvl_to_nerf.py
├── readme.md
├── requirements.txt
├── requirements_torch.txt
└── setup.py

Download .txt

SYMBOL INDEX (367 symbols across 57 files)

FILE: extern/chamfer3D/chamfer_cuda.cpp
  function chamfer_forward (line 25) | int chamfer_forward(at::Tensor xyz1,
  function chamfer_backward (line 34) | int chamfer_backward(at::Tensor xyz1,
  function PYBIND11_MODULE (line 46) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: extern/chamfer3D/dist_chamfer_3D.py
  class chamfer_3DFunction (line 41) | class chamfer_3DFunction(Function):
    method forward (line 43) | def forward(ctx, xyz1, xyz2):
    method backward (line 73) | def backward(ctx, graddist1, graddist2, gradidx1, gradidx2):
  class chamfer_3DDist (line 90) | class chamfer_3DDist(nn.Module):
    method __init__ (line 91) | def __init__(self):
    method forward (line 94) | def forward(self, input1, input2):

FILE: extern/fscore.py
  function fscore (line 4) | def fscore(dist1, dist2, threshold=0.001):

FILE: lidarnerf/activation.py
  class _trunc_exp (line 6) | class _trunc_exp(Function):
    method forward (line 9) | def forward(ctx, x):
    method backward (line 15) | def backward(ctx, g):

FILE: lidarnerf/convert.py
  function lidar_to_pano_with_intensities_with_bbox_mask (line 4) | def lidar_to_pano_with_intensities_with_bbox_mask(
  function lidar_to_pano_with_intensities (line 99) | def lidar_to_pano_with_intensities(
  function lidar_to_pano (line 163) | def lidar_to_pano(
  function pano_to_lidar_with_intensities (line 194) | def pano_to_lidar_with_intensities(pano: np.ndarray, intensities, lidar_K):
  function pano_to_lidar (line 236) | def pano_to_lidar(pano, lidar_K):
  function lidar_to_pano_with_intensities_fpa (line 253) | def lidar_to_pano_with_intensities_fpa(
  function parse_z_buffer (line 331) | def parse_z_buffer(novel_pano, lidar_H, lidar_W, threshold=0.2):

FILE: lidarnerf/dataset/base_dataset.py
  function custom_meshgrid (line 8) | def custom_meshgrid(*args):
  function get_lidar_rays (line 16) | def get_lidar_rays(poses, intrinsics, H, W, N=-1, patch_size=1):
  function get_rays (line 109) | def get_rays(poses, intrinsics, H, W, N=-1, patch_size=1):
  function nerf_matrix_to_ngp (line 186) | def nerf_matrix_to_ngp(pose, scale=0.33, offset=[0, 0, 0]):
  function visualize_poses (line 200) | def visualize_poses(poses, size=0.1):
  class BaseDataset (line 240) | class BaseDataset:

FILE: lidarnerf/dataset/kitti360_dataset.py
  class KITTI360Dataset (line 14) | class KITTI360Dataset(BaseDataset):
    method __post_init__ (line 32) | def __post_init__(self):
    method collate (line 123) | def collate(self, index):
    method dataloader (line 161) | def dataloader(self):
    method __len__ (line 174) | def __len__(self):

FILE: lidarnerf/dataset/nerfmvl_dataset.py
  class NeRFMVLDataset (line 14) | class NeRFMVLDataset(BaseDataset):
    method __post_init__ (line 32) | def __post_init__(self):
    method collate (line 116) | def collate(self, index):
    method dataloader (line 174) | def dataloader(self):
    method __len__ (line 187) | def __len__(self):

FILE: lidarnerf/encoding.py
  class FreqEncoder (line 6) | class FreqEncoder(nn.Module):
    method __init__ (line 7) | def __init__(
    method forward (line 35) | def forward(self, input, **kwargs):
  function get_encoder (line 50) | def get_encoder(
  class PeriodicVolumeEncoding (line 123) | class PeriodicVolumeEncoding(nn.Module):
    method __init__ (line 136) | def __init__(
    method parameters (line 174) | def parameters(self):
    method get_out_dim (line 177) | def get_out_dim(self) -> int:
    method hash_fn (line 180) | def hash_fn(self, in_tensor):
    method pytorch_fwd (line 201) | def pytorch_fwd(self, in_tensor):
    method forward (line 275) | def forward(self, in_tensor):
    method get_total_variation_loss (line 278) | def get_total_variation_loss(self):

FILE: lidarnerf/ffmlp/backend.py
  function find_cl_path (line 27) | def find_cl_path():

FILE: lidarnerf/ffmlp/ffmlp.py
  class _ffmlp_forward (line 14) | class _ffmlp_forward(Function):
    method forward (line 17) | def forward(
    method backward (line 96) | def backward(ctx, grad):
  function convert_activation (line 170) | def convert_activation(act):
  class FFMLP (line 187) | class FFMLP(nn.Module):
    method __init__ (line 188) | def __init__(
    method cleanup (line 235) | def cleanup(self):
    method __repr__ (line 239) | def __repr__(self):
    method reset_parameters (line 242) | def reset_parameters(self):
    method forward (line 247) | def forward(self, inputs):

FILE: lidarnerf/ffmlp/setup.py
  function find_cl_path (line 28) | def find_cl_path():

FILE: lidarnerf/ffmlp/src/bindings.cpp
  function PYBIND11_MODULE (line 5) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: lidarnerf/ffmlp/src/cutlass_matmul.h
  type Params (line 169) | struct Params {
  function CUTLASS_HOST_DEVICE (line 183) | CUTLASS_HOST_DEVICE
  function CUTLASS_HOST_DEVICE (line 186) | CUTLASS_HOST_DEVICE
  function CUTLASS_HOST_DEVICE (line 203) | CUTLASS_HOST_DEVICE
  type Params (line 258) | struct Params {
  function CUTLASS_HOST_DEVICE (line 274) | CUTLASS_HOST_DEVICE
  function CUTLASS_HOST_DEVICE (line 277) | CUTLASS_HOST_DEVICE
  function CUTLASS_HOST_DEVICE (line 301) | CUTLASS_HOST_DEVICE
  function cutlass_free_workspace (line 424) | inline void cutlass_free_workspace(cudaStream_t stream) {
  function typename (line 520) | typename Gemm::Arguments arguments{{M, N, K},

FILE: lidarnerf/ffmlp/src/utils.h
  function Activation (line 38) | enum class Activation {
  function enlarge (line 232) | void enlarge(const size_t size) {
  function memset (line 262) | void memset(const int value) { memset(value, m_size); }
  function copy_from_host (line 269) | void copy_from_host(const T *host_data, const size_t num_elements) {
  function copy_from_host (line 281) | void copy_from_host(const std::vector<T> &data, const size_t num_element...
  function copy_from_host (line 293) | void copy_from_host(const T *data) { copy_from_host(data, m_size); }
  function enlarge_and_copy_from_host (line 297) | void enlarge_and_copy_from_host(const T *data, const size_t num_elements) {
  function enlarge_and_copy_from_host (line 304) | void enlarge_and_copy_from_host(const std::vector<T> &data,
  function enlarge_and_copy_from_host (line 311) | void enlarge_and_copy_from_host(const std::vector<T> &data) {
  function resize_and_copy_from_host (line 317) | void resize_and_copy_from_host(const T *data, const size_t num_elements) {
  function resize_and_copy_from_host (line 323) | void resize_and_copy_from_host(const std::vector<T> &data,
  function resize_and_copy_from_host (line 329) | void resize_and_copy_from_host(const std::vector<T> &data) {
  function copy_from_host (line 335) | void copy_from_host(const std::vector<T> &data) {
  function copy_to_host (line 347) | void copy_to_host(T *host_data, const size_t num_elements) const {
  function copy_to_host (line 366) | void copy_to_host(std::vector<T> &data, const size_t num_elements) const {
  function copy_to_host (line 378) | void copy_to_host(T *data) const { copy_to_host(data, m_size); }
  function copy_to_host (line 381) | void copy_to_host(std::vector<T> &data) const {
  function copy_from_device (line 393) | void copy_from_device(const GPUMemory<T> &other) {
  function copy_from_device (line 410) | void copy_from_device(const GPUMemory<T> &other, const size_t size) {
  function T (line 431) | T *data() const {
  function std (line 463) | inline std::string bytes_to_string(size_t bytes) {
  function warp_activation (line 480) | void warp_activation(Activation activation,
  function warp_activation_backward_in (line 542) | void warp_activation_backward_in(
  function warp_activation_backward (line 610) | void warp_activation_backward(

FILE: lidarnerf/freqencoder/backend.py
  function find_cl_path (line 21) | def find_cl_path():

FILE: lidarnerf/freqencoder/freq.py
  class _freq_encoder (line 12) | class _freq_encoder(Function):
    method forward (line 15) | def forward(ctx, inputs, degree, output_dim):
    method backward (line 37) | def backward(ctx, grad):
  class FreqEncoder (line 55) | class FreqEncoder(nn.Module):
    method __init__ (line 56) | def __init__(self, input_dim=3, degree=4):
    method __repr__ (line 63) | def __repr__(self):
    method forward (line 66) | def forward(self, inputs, **kwargs):

FILE: lidarnerf/freqencoder/setup.py
  function find_cl_path (line 22) | def find_cl_path():

FILE: lidarnerf/freqencoder/src/bindings.cpp
  function PYBIND11_MODULE (line 5) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: lidarnerf/gridencoder/backend.py
  function find_cl_path (line 20) | def find_cl_path():

FILE: lidarnerf/gridencoder/grid.py
  class _grid_encode (line 24) | class _grid_encode(Function):
    method forward (line 27) | def forward(
    method backward (line 98) | def backward(ctx, grad):
  class GridEncoder (line 141) | class GridEncoder(nn.Module):
    method __init__ (line 142) | def __init__(
    method reset_parameters (line 202) | def reset_parameters(self):
    method __repr__ (line 206) | def __repr__(self):
    method forward (line 209) | def forward(self, inputs, bound=1):
    method grad_total_variation (line 239) | def grad_total_variation(self, weight=1e-7, inputs=None, bound=1, B=10...

FILE: lidarnerf/gridencoder/setup.py
  function find_cl_path (line 21) | def find_cl_path():

FILE: lidarnerf/gridencoder/src/bindings.cpp
  function PYBIND11_MODULE (line 5) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: lidarnerf/loss.py
  function mape_loss (line 6) | def mape_loss(pred, target, reduction="mean"):
  function huber_loss (line 18) | def huber_loss(pred, target, delta=0.1, reduction="mean"):
  class EffDistLoss (line 30) | class EffDistLoss(torch.autograd.Function):
    method forward (line 32) | def forward(ctx, w, m, interval):
    method backward (line 64) | def backward(ctx, grad_back):

FILE: lidarnerf/nerf/network.py
  class NeRFNetwork (line 10) | class NeRFNetwork(NeRFRenderer):
    method __init__ (line 11) | def __init__(
    method forward (line 131) | def forward(self, x, d):
    method density (line 162) | def density(self, x):
    method background (line 181) | def background(self, x, d):
    method color (line 199) | def color(self, x, d, cal_lidar_color=False, mask=None, geo_feat=None,...
    method get_params (line 240) | def get_params(self, lr):

FILE: lidarnerf/nerf/network_tcnn.py
  class NeRFNetwork (line 10) | class NeRFNetwork(NeRFRenderer):
    method __init__ (line 11) | def __init__(
    method forward (line 134) | def forward(self, x, d):
    method density (line 137) | def density(self, x):
    method color (line 154) | def color(self, x, d, cal_lidar_color=False, mask=None, geo_feat=None,...
    method get_params (line 198) | def get_params(self, lr):

FILE: lidarnerf/nerf/renderer.py
  function sample_pdf (line 10) | def sample_pdf(bins, weights, n_samples, det=False):
  function plot_pointcloud (line 49) | def plot_pointcloud(pc, color=None):
  class NeRFRenderer (line 61) | class NeRFRenderer(nn.Module):
    method __init__ (line 62) | def __init__(
    method forward (line 89) | def forward(self, x, d):
    method density (line 93) | def density(self, x):
    method color (line 96) | def color(self, x, d, mask=None, **kwargs):
    method run (line 99) | def run(
    method render (line 300) | def render(

FILE: lidarnerf/nerf/utils.py
  function is_ali_cluster (line 31) | def is_ali_cluster():
  function linear_to_srgb (line 39) | def linear_to_srgb(x):
  function srgb_to_linear (line 44) | def srgb_to_linear(x):
  function filter_bbox_dataset (line 48) | def filter_bbox_dataset(pc, OBB_local):
  function filter_poly (line 60) | def filter_poly(pcs, OBB_2D):
  function sort_quadrilateral (line 68) | def sort_quadrilateral(points):
  function is_in_poly (line 80) | def is_in_poly(px, py, poly):
  function seed_everything (line 104) | def seed_everything(seed):
  function torch_vis_2d (line 114) | def torch_vis_2d(x, renormalize=False):
  function extract_fields (line 139) | def extract_fields(bound_min, bound_max, resolution, query_func, S=128):
  function extract_geometry (line 169) | def extract_geometry(bound_min, bound_max, resolution, threshold, query_...
  class PSNRMeter (line 187) | class PSNRMeter:
    method __init__ (line 188) | def __init__(self):
    method clear (line 192) | def clear(self):
    method prepare_inputs (line 196) | def prepare_inputs(self, *inputs):
    method update (line 205) | def update(self, preds, truths):
    method measure (line 216) | def measure(self):
    method write (line 219) | def write(self, writer, global_step, prefix=""):
    method report (line 222) | def report(self):
  class RMSEMeter (line 226) | class RMSEMeter:
    method __init__ (line 227) | def __init__(self):
    method clear (line 231) | def clear(self):
    method prepare_inputs (line 235) | def prepare_inputs(self, *inputs):
    method update (line 244) | def update(self, preds, truths):
    method measure (line 255) | def measure(self):
    method write (line 258) | def write(self, writer, global_step, prefix=""):
    method report (line 261) | def report(self):
  class MAEMeter (line 265) | class MAEMeter:
    method __init__ (line 266) | def __init__(self, intensity_inv_scale=1.0):
    method clear (line 271) | def clear(self):
    method prepare_inputs (line 275) | def prepare_inputs(self, *inputs):
    method update (line 284) | def update(self, preds, truths):
    method measure (line 297) | def measure(self):
    method write (line 300) | def write(self, writer, global_step, prefix=""):
    method report (line 303) | def report(self):
  class DepthMeter (line 307) | class DepthMeter:
    method __init__ (line 308) | def __init__(self, scale):
    method clear (line 314) | def clear(self):
    method prepare_inputs (line 318) | def prepare_inputs(self, *inputs):
    method update (line 327) | def update(self, preds, truths):
    method compute_depth_errors (line 341) | def compute_depth_errors(
    method measure (line 362) | def measure(self):
    method write (line 366) | def write(self, writer, global_step, prefix=""):
    method report (line 371) | def report(self):
  class PointsMeter (line 375) | class PointsMeter:
    method __init__ (line 376) | def __init__(self, scale, intrinsics):
    method clear (line 382) | def clear(self):
    method prepare_inputs (line 386) | def prepare_inputs(self, *inputs):
    method update (line 395) | def update(self, preds, truths):
    method measure (line 418) | def measure(self):
    method write (line 423) | def write(self, writer, global_step, prefix=""):
    method report (line 426) | def report(self):
  class SSIMMeter (line 430) | class SSIMMeter:
    method __init__ (line 431) | def __init__(self, device=None):
    method clear (line 441) | def clear(self):
    method prepare_inputs (line 453) | def prepare_inputs(self, *inputs):
    method update (line 462) | def update(self, preds, truths):
    method measure (line 476) | def measure(self):
    method write (line 479) | def write(self, writer, global_step, prefix=""):
    method report (line 482) | def report(self):
  class LPIPSMeter (line 486) | class LPIPSMeter:
    method __init__ (line 487) | def __init__(self, net="alex", device=None):
    method clear (line 499) | def clear(self):
    method prepare_inputs (line 503) | def prepare_inputs(self, *inputs):
    method update (line 511) | def update(self, preds, truths):
    method measure (line 521) | def measure(self):
    method write (line 524) | def write(self, writer, global_step, prefix=""):
    method report (line 529) | def report(self):
  class Trainer (line 533) | class Trainer(object):
    method __init__ (line 534) | def __init__(
    method __del__ (line 682) | def __del__(self):
    method log (line 686) | def log(self, *args, **kwargs):
    method train_step (line 697) | def train_step(self, data):
    method eval_step (line 886) | def eval_step(self, data):
    method test_step (line 980) | def test_step(self, data, bg_color=None, perturb=False):
    method save_mesh (line 1011) | def save_mesh(self, save_path=None, resolution=256, threshold=10):
    method train (line 1044) | def train(self, train_loader, valid_loader, max_epochs):
    method evaluate (line 1079) | def evaluate(self, loader, name=None):
    method test (line 1084) | def test(self, loader, save_path=None, name=None, write_video=True):
    method train_one_epoch (line 1179) | def train_one_epoch(self, loader):
    method evaluate_one_epoch (line 1282) | def evaluate_one_epoch(self, loader, name=None):
    method save_checkpoint (line 1449) | def save_checkpoint(self, name=None, full=False, best=False, remove_ol...
    method load_checkpoint (line 1512) | def load_checkpoint(self, checkpoint=None, model_only=False):

FILE: lidarnerf/raymarching/backend.py
  function find_cl_path (line 20) | def find_cl_path():

FILE: lidarnerf/raymarching/raymarching.py
  class _near_far_from_aabb (line 15) | class _near_far_from_aabb(Function):
    method forward (line 18) | def forward(ctx, rays_o, rays_d, aabb, min_near=0.2):
  class _sph_from_ray (line 51) | class _sph_from_ray(Function):
    method forward (line 54) | def forward(ctx, rays_o, rays_d, radius):
  class _morton3D (line 85) | class _morton3D(Function):
    method forward (line 87) | def forward(ctx, coords):
  class _morton3D_invert (line 111) | class _morton3D_invert(Function):
    method forward (line 113) | def forward(ctx, indices):
  class _packbits (line 136) | class _packbits(Function):
    method forward (line 139) | def forward(ctx, grid, thresh, bitfield=None):
  class _march_rays_train (line 171) | class _march_rays_train(Function):
    method forward (line 174) | def forward(
  class _composite_rays_train (line 292) | class _composite_rays_train(Function):
    method forward (line 295) | def forward(ctx, sigmas, rgbs, deltas, rays, T_thresh=1e-4):
    method backward (line 329) | def backward(ctx, grad_weights_sum, grad_depth, grad_image):
  class _march_rays (line 367) | class _march_rays(Function):
    method forward (line 370) | def forward(
  class _composite_rays (line 463) | class _composite_rays(Function):
    method forward (line 466) | def forward(

FILE: lidarnerf/raymarching/setup.py
  function find_cl_path (line 21) | def find_cl_path():

FILE: lidarnerf/raymarching/src/bindings.cpp
  function PYBIND11_MODULE (line 5) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: lidarnerf/shencoder/backend.py
  function find_cl_path (line 20) | def find_cl_path():

FILE: lidarnerf/shencoder/setup.py
  function find_cl_path (line 21) | def find_cl_path():

FILE: lidarnerf/shencoder/sphere_harmonics.py
  class _sh_encoder (line 12) | class _sh_encoder(Function):
    method forward (line 15) | def forward(ctx, inputs, degree, calc_grad_inputs=False):
    method backward (line 42) | def backward(ctx, grad):
  class SHEncoder (line 62) | class SHEncoder(nn.Module):
    method __init__ (line 63) | def __init__(self, input_dim=3, degree=4):
    method __repr__ (line 75) | def __repr__(self):
    method forward (line 78) | def forward(self, inputs, size=1):

FILE: lidarnerf/shencoder/src/bindings.cpp
  function PYBIND11_MODULE (line 5) | PYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {

FILE: lidarnvs/eval.py
  function eval_points_and_pano (line 9) | def eval_points_and_pano(

FILE: lidarnvs/lidarnvs_base.py
  class LidarNVSBase (line 6) | class LidarNVSBase(ABC):
    method fit (line 8) | def fit(self, dataset) -> None:
    method predict_frame (line 17) | def predict_frame(
    method predict_frame_with_raydrop (line 43) | def predict_frame_with_raydrop(

FILE: lidarnvs/lidarnvs_meshing.py
  class LidarNVSMeshing (line 20) | class LidarNVSMeshing(LidarNVSBase):
    method __init__ (line 28) | def __init__(self, ckpt_path=None):
    method fit (line 55) | def fit(self, dataset) -> None:
    method predict_frame (line 100) | def predict_frame(
    method predict_frame_with_raydrop (line 170) | def predict_frame_with_raydrop(
    method intersect_rays (line 293) | def intersect_rays(self, rays):
    method intersect_lidar (line 336) | def intersect_lidar(
  function generate_raydrop_data_meshing (line 356) | def generate_raydrop_data_meshing(dataset, nvs: LidarNVSMeshing) -> dict:

FILE: lidarnvs/lidarnvs_nksr.py
  class LidarNVSNksr (line 9) | class LidarNVSNksr(LidarNVSMeshing):
    method __init__ (line 10) | def __init__(self, ckpt_path=None):
    method _run_nksr (line 26) | def _run_nksr(

FILE: lidarnvs/lidarnvs_pcgen.py
  class LidarNVSPCGen (line 16) | class LidarNVSPCGen(LidarNVSBase):
    method __init__ (line 17) | def __init__(self, raycasting="cp", ckpt_path=None):
    method fit (line 36) | def fit(self, dataset) -> None:
    method predict_frame (line 60) | def predict_frame(
    method predict_frame_with_raydrop (line 132) | def predict_frame_with_raydrop(
  function generate_raydrop_data_pcgen (line 197) | def generate_raydrop_data_pcgen(dataset, nvs: LidarNVSPCGen, rm_pano_mas...
  function get_direction (line 236) | def get_direction(lidar_H, lidar_W, lidar_K):

FILE: lidarnvs/lidarnvs_poisson.py
  class LidarNVSPoisson (line 9) | class LidarNVSPoisson(LidarNVSMeshing):
    method _run_poisson (line 11) | def _run_poisson(
    method __init__ (line 31) | def __init__(

FILE: lidarnvs/loader.py
  function extract_dataset_frame (line 8) | def extract_dataset_frame(

FILE: lidarnvs/plot_possion_grid_search.py
  function main (line 9) | def main():

FILE: lidarnvs/raydrop_dataset_poisson.py
  class RaydropDataset (line 8) | class RaydropDataset(Dataset):
    method __init__ (line 9) | def __init__(self, data_dir, split):
    method __len__ (line 23) | def __len__(self):
    method __getitem__ (line 26) | def __getitem__(self, idx):
    method collate_fn (line 30) | def collate_fn(batch):

FILE: lidarnvs/raydrop_train_pcgen.py
  function setup_seed (line 24) | def setup_seed(seed):
  function cal_psnr (line 35) | def cal_psnr(im1, im2):
  class RayDrop (line 41) | class RayDrop(nn.Module):
    method __init__ (line 42) | def __init__(self, D=4, W=128, input_ch=3, output_ch=1):
    method forward (line 57) | def forward(self, x):
  function weights_init (line 66) | def weights_init(m):
  function config_parser (line 73) | def config_parser():
  function cosine_scheduler (line 205) | def cosine_scheduler(
  function get_embedder (line 222) | def get_embedder(multires, input_dims=3, i=0):
  class Embedder (line 241) | class Embedder:
    method __init__ (line 242) | def __init__(self, **kwargs):
    method create_embedding_fn (line 246) | def create_embedding_fn(self):
    method embed (line 271) | def embed(self, inputs):
  function run_network (line 275) | def run_network(inputs, model, embed_fn, embeddirs_fn):
  function load_pkl_data (line 286) | def load_pkl_data(data_dir, split):
  function train (line 299) | def train():

FILE: lidarnvs/raydrop_train_poisson.py
  function evaluate (line 18) | def evaluate(net, dataloader, device, amp):
  function train_model (line 75) | def train_model(
  function get_args (line 262) | def get_args():
  function main (line 307) | def main():

FILE: lidarnvs/run.py
  function main (line 18) | def main():

FILE: lidarnvs/unet.py
  class DoubleConv (line 7) | class DoubleConv(nn.Module):
    method __init__ (line 10) | def __init__(self, in_channels, out_channels, mid_channels=None):
    method forward (line 23) | def forward(self, x):
  class Down (line 27) | class Down(nn.Module):
    method __init__ (line 30) | def __init__(self, in_channels, out_channels):
    method forward (line 36) | def forward(self, x):
  class Up (line 40) | class Up(nn.Module):
    method __init__ (line 43) | def __init__(self, in_channels, out_channels, bilinear=True):
    method forward (line 56) | def forward(self, x1, x2):
  class OutConv (line 78) | class OutConv(nn.Module):
    method __init__ (line 79) | def __init__(self, in_channels, out_channels):
    method forward (line 83) | def forward(self, x):
  class UNet (line 87) | class UNet(nn.Module):
    method __init__ (line 88) | def __init__(self, n_channels, n_classes, bilinear=False):
    method forward (line 106) | def forward(self, x):
  function dice_coeff (line 120) | def dice_coeff(
  function multiclass_dice_coeff (line 143) | def multiclass_dice_coeff(
  function dice_loss (line 155) | def dice_loss(input: Tensor, target: Tensor, multiclass: bool = False):
  function main (line 161) | def main():

FILE: main_lidarnerf.py
  function get_arg_parser (line 16) | def get_arg_parser():
  function main (line 226) | def main():

FILE: preprocess/cal_centerpose_bound.py
  function cal_centerpose_bound_scale (line 10) | def cal_centerpose_bound_scale(
  function get_path_pose_from_json (line 67) | def get_path_pose_from_json(root_path, sequence_id):
  function main (line 83) | def main():

FILE: preprocess/generate_train_rangeview.py
  function all_points_to_world (line 14) | def all_points_to_world(pcd_path_list, lidar2world_list):
  function oriented_bounding_box (line 24) | def oriented_bounding_box(data):
  function get_dataset_bbox (line 47) | def get_dataset_bbox(all_class, dataset_root, out_dir):
  function LiDAR_2_Pano_NeRF_MVL (line 73) | def LiDAR_2_Pano_NeRF_MVL(
  function generate_nerf_mvl_train_data (line 95) | def generate_nerf_mvl_train_data(
  function create_nerf_mvl_rangeview (line 140) | def create_nerf_mvl_rangeview():
  function LiDAR_2_Pano_KITTI (line 180) | def LiDAR_2_Pano_KITTI(
  function generate_train_data (line 196) | def generate_train_data(
  function create_kitti_rangeview (line 225) | def create_kitti_rangeview():
  function main (line 261) | def main():

FILE: preprocess/kitti360_loader.py
  class KITTI360Loader (line 7) | class KITTI360Loader:
    method __init__ (line 8) | def __init__(self, kitti_360_root) -> None:
    method _read_variable (line 37) | def _read_variable(fid, name, M, N):
    method _load_perspective_intrinsics (line 68) | def _load_perspective_intrinsics(intrinsics_path):
    method load_images (line 99) | def load_images(self, camera_name, sequence_name, frame_ids):
    method get_image_paths (line 115) | def get_image_paths(self, camera_name, sequence_name, frame_ids):
    method _load_all_cameras (line 144) | def _load_all_cameras(self, sequence_name):
    method load_cameras (line 229) | def load_cameras(self, camera_name, sequence_name, frame_ids):
    method _load_all_lidars (line 260) | def _load_all_lidars(self, sequence_name):
    method load_lidars (line 305) | def load_lidars(self, sequence_name, frame_ids):
  function main (line 320) | def main():

FILE: preprocess/kitti360_to_nerf.py
  function normalize_Ts (line 9) | def normalize_Ts(Ts):
  function main (line 26) | def main():

FILE: preprocess/nerfmvl_loader.py
  class NeRFMVLLoader (line 5) | class NeRFMVLLoader:
    method __init__ (line 6) | def __init__(self, nerf_mvl_root, class_name) -> None:
    method _load_all_lidars (line 22) | def _load_all_lidars(
    method load_lidars (line 35) | def load_lidars(self, frame_ids):
  function main (line 49) | def main():

FILE: preprocess/nerfmvl_to_nerf.py
  function main (line 8) | def main():

FILE: setup.py
  function main (line 8) | def main():

Download .json

Condensed preview — 90 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (620K chars).

[
  {
    "path": ".github/workflows/formatter.yml",
    "chars": 238,
    "preview": "name: Formatter\n\non:\n  push:\n    branches:\n      - main\n  pull_request:\n    types: [opened, reopened, synchronize]\n\njobs"
  },
  {
    "path": "LICENSE",
    "chars": 1077,
    "preview": "MIT License\n\nCopyright (c) 2022 Tao Tang, Yixing Lao\n\nPermission is hereby granted, free of charge, to any person obtain"
  },
  {
    "path": "configs/kitti360_1538.txt",
    "chars": 301,
    "preview": "sequence_id = 1538\nalpha_d = 1000.0\nalpha_r = 1\nalpha_i = 1e1\nalpha_grad = 100.0\ngrad_loss = True\ndesired_resolution = 3"
  },
  {
    "path": "configs/kitti360_1728.txt",
    "chars": 301,
    "preview": "sequence_id = 1728\nalpha_d = 1000.0\nalpha_r = 1\nalpha_i = 1e1\nalpha_grad = 100.0\ngrad_loss = True\ndesired_resolution = 3"
  },
  {
    "path": "configs/kitti360_1908.txt",
    "chars": 301,
    "preview": "sequence_id = 1908\nalpha_d = 1000.0\nalpha_r = 1\nalpha_i = 1e1\nalpha_grad = 100.0\ngrad_loss = True\ndesired_resolution = 3"
  },
  {
    "path": "configs/kitti360_3353.txt",
    "chars": 300,
    "preview": "sequence_id = 3353\nalpha_d = 1000.0\nalpha_r = 1\nalpha_i = 1e1\nalpha_grad = 100.0\ngrad_loss = True\ndesired_resolution = 3"
  },
  {
    "path": "configs/nerf_mvl.txt",
    "chars": 336,
    "preview": "path = data/nerf_mvl\ndataloader = nerf_mvl\nsequence_id = car\nalpha_d = 1000.0\nalpha_r = 1\nalpha_i = 1\nalpha_grad = 100.0"
  },
  {
    "path": "extern/chamfer3D/chamfer3D.cu",
    "chars": 10065,
    "preview": "\n#include <ATen/ATen.h>\n#include <cuda.h>\n#include <cuda_runtime.h>\n#include <stdio.h>\n\n#include <vector>\n\n__global__ vo"
  },
  {
    "path": "extern/chamfer3D/chamfer_cuda.cpp",
    "chars": 1649,
    "preview": "#include <torch/torch.h>\n\n#include <vector>\n\n/// TMP\n// #include \"common.h\"\n/// NOT TMP\n\nint chamfer_cuda_forward(at::Te"
  },
  {
    "path": "extern/chamfer3D/dist_chamfer_3D.py",
    "chars": 3014,
    "preview": "from torch import nn\nfrom torch.autograd import Function\nimport torch\nimport importlib\nimport os\nimport sys\nfrom pathlib"
  },
  {
    "path": "extern/chamfer3D/setup.py",
    "chars": 434,
    "preview": "from setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\nsetup(\n    name=\"chamf"
  },
  {
    "path": "extern/fscore.py",
    "chars": 715,
    "preview": "import torch\n\n\ndef fscore(dist1, dist2, threshold=0.001):\n    \"\"\"\n    Calculates the F-score between two point clouds wi"
  },
  {
    "path": "lidarmvl/readme.md",
    "chars": 2420,
    "preview": "# LiDAR-MVL\n\n![dataset_vis.png](../assets/dataset_vis.png)\n\n| Sensor                        | Details (Sensor location: "
  },
  {
    "path": "lidarnerf/__init__.py",
    "chars": 22,
    "preview": "__version__ = \"0.1.0\"\n"
  },
  {
    "path": "lidarnerf/activation.py",
    "chars": 467,
    "preview": "import torch\nfrom torch.autograd import Function\nfrom torch.cuda.amp import custom_bwd, custom_fwd\n\n\nclass _trunc_exp(Fu"
  },
  {
    "path": "lidarnerf/convert.py",
    "chars": 11092,
    "preview": "import numpy as np\n\n\ndef lidar_to_pano_with_intensities_with_bbox_mask(\n    local_points_with_intensities: np.ndarray,\n "
  },
  {
    "path": "lidarnerf/dataset/base_dataset.py",
    "chars": 7812,
    "preview": "import numpy as np\nimport torch\nimport trimesh\nfrom packaging import version as pver\nfrom dataclasses import dataclass\n\n"
  },
  {
    "path": "lidarnerf/dataset/kitti360_dataset.py",
    "chars": 6382,
    "preview": "import json\nimport os\n\nimport numpy as np\nimport torch\nimport tqdm\nfrom torch.utils.data import DataLoader\nfrom dataclas"
  },
  {
    "path": "lidarnerf/dataset/nerfmvl_dataset.py",
    "chars": 7520,
    "preview": "import os\nimport json\nimport tqdm\nimport numpy as np\n\nimport torch\nfrom torch.utils.data import DataLoader\nfrom dataclas"
  },
  {
    "path": "lidarnerf/encoding.py",
    "chars": 9702,
    "preview": "import torch\nimport torch.nn as nn\nimport numpy as np\n\n\nclass FreqEncoder(nn.Module):\n    def __init__(\n        self,\n  "
  },
  {
    "path": "lidarnerf/ffmlp/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lidarnerf/ffmlp/backend.py",
    "chars": 1827,
    "preview": "import os\nfrom torch.utils.cpp_extension import load\n\n_src_path = os.path.dirname(os.path.abspath(__file__))\n\nnvcc_flags"
  },
  {
    "path": "lidarnerf/ffmlp/ffmlp.py",
    "chars": 8966,
    "preview": "import math\n\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import Function\nfrom torch.cuda.amp import custom_bw"
  },
  {
    "path": "lidarnerf/ffmlp/setup.py",
    "chars": 2222,
    "preview": "import os\nfrom setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\n_src_path = "
  },
  {
    "path": "lidarnerf/ffmlp/src/bindings.cpp",
    "chars": 443,
    "preview": "#include <torch/extension.h>\n\n#include \"ffmlp.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n    m.def(\"ffmlp_forward\", "
  },
  {
    "path": "lidarnerf/ffmlp/src/cutlass_matmul.h",
    "chars": 24788,
    "preview": "/*\n * Copyright (c) 2020-2022, NVIDIA CORPORATION.  All rights reserved.\n *\n * Redistribution and use in source and bina"
  },
  {
    "path": "lidarnerf/ffmlp/src/ffmlp.cu",
    "chars": 53662,
    "preview": "#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_fp16.h>\n#include <cuda_runtime.h>\n#include <mma.h>\n#"
  },
  {
    "path": "lidarnerf/ffmlp/src/ffmlp.h",
    "chars": 1892,
    "preview": "#pragma once\n\n#include <stdint.h>\n#include <torch/torch.h>\n\n// activation: should have been enum, here we just use int.\n"
  },
  {
    "path": "lidarnerf/ffmlp/src/utils.h",
    "chars": 23506,
    "preview": "#pragma once\n\n#include <cuda.h>\n#include <cuda_fp16.h>\n#include <cuda_runtime.h>\n\n#include <array>\n#include <atomic>\n#in"
  },
  {
    "path": "lidarnerf/freqencoder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lidarnerf/freqencoder/backend.py",
    "chars": 1483,
    "preview": "import os\nfrom torch.utils.cpp_extension import load\n\n_src_path = os.path.dirname(os.path.abspath(__file__))\n\nnvcc_flags"
  },
  {
    "path": "lidarnerf/freqencoder/freq.py",
    "chars": 2164,
    "preview": "import torch\nimport torch.nn as nn\nfrom torch.autograd import Function\nfrom torch.cuda.amp import custom_bwd, custom_fwd"
  },
  {
    "path": "lidarnerf/freqencoder/setup.py",
    "chars": 1859,
    "preview": "import os\nfrom setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\n_src_path = "
  },
  {
    "path": "lidarnerf/freqencoder/src/bindings.cpp",
    "chars": 295,
    "preview": "#include <torch/extension.h>\n\n#include \"freqencoder.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n    m.def(\"freq_encod"
  },
  {
    "path": "lidarnerf/freqencoder/src/freqencoder.cu",
    "chars": 4579,
    "preview": "#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_fp16.h>\n#include <cuda_runtime.h>\n#include <stdint.h"
  },
  {
    "path": "lidarnerf/freqencoder/src/freqencoder.h",
    "chars": 826,
    "preview": "#pragma once\n\n#include <stdint.h>\n#include <torch/torch.h>\n\n// _backend.freq_encode_forward(inputs, B, input_dim, degree"
  },
  {
    "path": "lidarnerf/gridencoder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lidarnerf/gridencoder/backend.py",
    "chars": 1462,
    "preview": "import os\nfrom torch.utils.cpp_extension import load\n\n_src_path = os.path.dirname(os.path.abspath(__file__))\n\nnvcc_flags"
  },
  {
    "path": "lidarnerf/gridencoder/grid.py",
    "chars": 8783,
    "preview": "import numpy as np\n\nimport torch\nimport torch.nn as nn\nfrom torch.autograd import Function\nfrom torch.cuda.amp import cu"
  },
  {
    "path": "lidarnerf/gridencoder/setup.py",
    "chars": 1837,
    "preview": "import os\nfrom setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\n_src_path = "
  },
  {
    "path": "lidarnerf/gridencoder/src/bindings.cpp",
    "chars": 394,
    "preview": "#include <torch/extension.h>\n\n#include \"gridencoder.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n    m.def(\"grid_encod"
  },
  {
    "path": "lidarnerf/gridencoder/src/gridencoder.cu",
    "chars": 34969,
    "preview": "#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_fp16.h>\n#include <cuda_runtime.h>\n#include <stdint.h"
  },
  {
    "path": "lidarnerf/gridencoder/src/gridencoder.h",
    "chars": 2326,
    "preview": "#ifndef _HASH_ENCODE_H\n#define _HASH_ENCODE_H\n\n#include <stdint.h>\n#include <torch/torch.h>\n\n// inputs: [B, D], float, i"
  },
  {
    "path": "lidarnerf/loss.py",
    "chars": 2945,
    "preview": "import torch\n\nimport numpy as np\n\n\ndef mape_loss(pred, target, reduction=\"mean\"):\n    # pred, target: [B, 1], torch tens"
  },
  {
    "path": "lidarnerf/nerf/network.py",
    "chars": 7751,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\nfrom lidarnerf.encoding import get_encoder\nfrom lida"
  },
  {
    "path": "lidarnerf/nerf/network_tcnn.py",
    "chars": 6778,
    "preview": "import torch\n\nimport numpy as np\n\nimport tinycudann as tcnn\nfrom lidarnerf.activation import trunc_exp\nfrom .renderer im"
  },
  {
    "path": "lidarnerf/nerf/renderer.py",
    "chars": 12479,
    "preview": "import math\nimport trimesh\n\nimport torch\nimport torch.nn as nn\n\nfrom lidarnerf import raymarching\n\n\ndef sample_pdf(bins,"
  },
  {
    "path": "lidarnerf/nerf/utils.py",
    "chars": 55785,
    "preview": "import glob\nimport os\nimport random\nimport time\n\nimport cv2\nimport imageio\nimport lpips\nimport mcubes\nimport numpy as np"
  },
  {
    "path": "lidarnerf/raymarching/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lidarnerf/raymarching/backend.py",
    "chars": 1461,
    "preview": "import os\nfrom torch.utils.cpp_extension import load\n\n_src_path = os.path.dirname(os.path.abspath(__file__))\n\nnvcc_flags"
  },
  {
    "path": "lidarnerf/raymarching/raymarching.py",
    "chars": 15927,
    "preview": "import torch\nfrom torch.autograd import Function\nfrom torch.cuda.amp import custom_bwd, custom_fwd\n\ntry:\n    import _ray"
  },
  {
    "path": "lidarnerf/raymarching/setup.py",
    "chars": 2272,
    "preview": "import os\nfrom setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\n_src_path = "
  },
  {
    "path": "lidarnerf/raymarching/src/bindings.cpp",
    "chars": 933,
    "preview": "#include <torch/extension.h>\n\n#include \"raymarching.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n    // utils\n    m.de"
  },
  {
    "path": "lidarnerf/raymarching/src/raymarching.cu",
    "chars": 38197,
    "preview": "#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_fp16.h>\n#include <cuda_runtime.h>\n#include <stdint.h"
  },
  {
    "path": "lidarnerf/raymarching/src/raymarching.h",
    "chars": 4154,
    "preview": "#pragma once\n\n#include <stdint.h>\n#include <torch/torch.h>\n\nvoid near_far_from_aabb(const at::Tensor rays_o,\n           "
  },
  {
    "path": "lidarnerf/shencoder/__init__.py",
    "chars": 0,
    "preview": ""
  },
  {
    "path": "lidarnerf/shencoder/backend.py",
    "chars": 1458,
    "preview": "import os\nfrom torch.utils.cpp_extension import load\n\n_src_path = os.path.dirname(os.path.abspath(__file__))\n\nnvcc_flags"
  },
  {
    "path": "lidarnerf/shencoder/setup.py",
    "chars": 1831,
    "preview": "import os\nfrom setuptools import setup\nfrom torch.utils.cpp_extension import BuildExtension, CUDAExtension\n\n_src_path = "
  },
  {
    "path": "lidarnerf/shencoder/sphere_harmonics.py",
    "chars": 2683,
    "preview": "import torch\nimport torch.nn as nn\nfrom torch.autograd import Function\nfrom torch.cuda.amp import custom_bwd, custom_fwd"
  },
  {
    "path": "lidarnerf/shencoder/src/bindings.cpp",
    "chars": 271,
    "preview": "#include <torch/extension.h>\n\n#include \"shencoder.h\"\n\nPYBIND11_MODULE(TORCH_EXTENSION_NAME, m) {\n    m.def(\"sh_encode_fo"
  },
  {
    "path": "lidarnerf/shencoder/src/shencoder.cu",
    "chars": 47036,
    "preview": "#include <ATen/cuda/CUDAContext.h>\n#include <cuda.h>\n#include <cuda_fp16.h>\n#include <cuda_runtime.h>\n#include <stdint.h"
  },
  {
    "path": "lidarnerf/shencoder/src/shencoder.h",
    "chars": 688,
    "preview": "#pragma once\n\n#include <stdint.h>\n#include <torch/torch.h>\n\n// inputs: [B, D], float, in [-1, 1]\n// outputs: [B, F], flo"
  },
  {
    "path": "lidarnvs/__init__.py",
    "chars": 22,
    "preview": "__version__ = \"0.0.1\"\n"
  },
  {
    "path": "lidarnvs/configs/pcgen_kitti360_raydrop.txt",
    "chars": 326,
    "preview": "basedir = pcgen_raydrop_log/kitti360seq1908\ndatadir = data/raydrop/pcgen/kitti360_1908\ndataset = kitti360\nno_batching = "
  },
  {
    "path": "lidarnvs/configs/pcgen_nerfmvl_raydrop.txt",
    "chars": 411,
    "preview": "# ['water_safety_barrier', 'tire', 'pier', 'plant', 'warning_sign', 'bollard', 'pedestrian', 'car',  'traffic_cone']\nexp"
  },
  {
    "path": "lidarnvs/eval.py",
    "chars": 4740,
    "preview": "import numpy as np\nimport torch\n\nfrom skimage.metrics import structural_similarity\nfrom extern.chamfer3D.dist_chamfer_3D"
  },
  {
    "path": "lidarnvs/lidarnvs_base.py",
    "chars": 1246,
    "preview": "from abc import ABC, abstractmethod\n\nimport numpy as np\n\n\nclass LidarNVSBase(ABC):\n    @abstractmethod\n    def fit(self,"
  },
  {
    "path": "lidarnvs/lidarnvs_meshing.py",
    "chars": 16195,
    "preview": "import camtools as ct\nimport matplotlib.pyplot as plt\nimport numpy as np\nimport open3d as o3d\nimport open3d.core as o3c\n"
  },
  {
    "path": "lidarnvs/lidarnvs_nksr.py",
    "chars": 1544,
    "preview": "import open3d as o3d\nimport numpy as np\nfrom lidarnvs.lidarnvs_meshing import LidarNVSMeshing\n\nimport torch\nimport nksr\n"
  },
  {
    "path": "lidarnvs/lidarnvs_pcgen.py",
    "chars": 8594,
    "preview": "import camtools as ct\nimport numpy as np\nimport torch\nfrom tqdm import tqdm\n\nfrom lidarnerf.convert import (\n    lidar_t"
  },
  {
    "path": "lidarnvs/lidarnvs_poisson.py",
    "chars": 1758,
    "preview": "import time\n\nimport open3d as o3d\nimport numpy as np\nfrom lidarnvs.lidarnvs_meshing import LidarNVSMeshing\nimport functo"
  },
  {
    "path": "lidarnvs/loader.py",
    "chars": 2594,
    "preview": "import camtools as ct\nimport numpy as np\n\nfrom lidarnerf.dataset.base_dataset import get_lidar_rays\nfrom lidarnerf.conve"
  },
  {
    "path": "lidarnvs/plot_possion_grid_search.py",
    "chars": 1717,
    "preview": "from pathlib import Path\nimport json\n\n\nimport matplotlib.pyplot as plt\nimport numpy as np\n\n\ndef main():\n    json_path = "
  },
  {
    "path": "lidarnvs/raydrop_dataset_poisson.py",
    "chars": 2002,
    "preview": "import pickle\nfrom pathlib import Path\n\nimport torch\nfrom torch.utils.data import Dataset\n\n\nclass RaydropDataset(Dataset"
  },
  {
    "path": "lidarnvs/raydrop_train_pcgen.py",
    "chars": 17140,
    "preview": "import os\nimport numpy as np\nimport imageio\nimport random\nimport torch\nimport torch.nn as nn\nimport matplotlib.pyplot as"
  },
  {
    "path": "lidarnvs/raydrop_train_poisson.py",
    "chars": 12591,
    "preview": "import argparse\nimport logging\nfrom pathlib import Path\n\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional a"
  },
  {
    "path": "lidarnvs/readme.md",
    "chars": 2626,
    "preview": "# Lidar Novel View Synthesis Baselines\n\n![baseline_render](../assets/baseline-render.png)\n\n## LidarSim\n\n```bash\n# Genera"
  },
  {
    "path": "lidarnvs/run.py",
    "chars": 9929,
    "preview": "from pathlib import Path\n\nimport numpy as np\n\nfrom lidarnvs.lidarnvs_pcgen import LidarNVSPCGen, generate_raydrop_data_p"
  },
  {
    "path": "lidarnvs/unet.py",
    "chars": 5009,
    "preview": "import torch\nimport torch.nn as nn\nimport torch.nn.functional as F\nfrom torch import Tensor\n\n\nclass DoubleConv(nn.Module"
  },
  {
    "path": "main_lidarnerf.py",
    "chars": 15467,
    "preview": "import torch\nimport configargparse\nimport os\nimport numpy as np\n\nfrom lidarnerf.nerf.utils import (\n    seed_everything,"
  },
  {
    "path": "preprocess/cal_centerpose_bound.py",
    "chars": 3159,
    "preview": "import numpy as np\n\nnp.set_printoptions(suppress=True)\nimport os\nimport json\nimport tqdm\nfrom lidarnerf.convert import p"
  },
  {
    "path": "preprocess/generate_train_rangeview.py",
    "chars": 8328,
    "preview": "import numpy as np\nimport os\nfrom pathlib import Path\nfrom tqdm import tqdm\nimport shutil\nimport argparse\n\nfrom lidarner"
  },
  {
    "path": "preprocess/kitti360_loader.py",
    "chars": 13698,
    "preview": "from pathlib import Path\nimport numpy as np\nimport camtools as ct\nimport open3d as o3d\n\n\nclass KITTI360Loader:\n    def _"
  },
  {
    "path": "preprocess/kitti360_to_nerf.py",
    "chars": 5525,
    "preview": "from pathlib import Path\n\nfrom kitti360_loader import KITTI360Loader\nimport camtools as ct\nimport numpy as np\nimport jso"
  },
  {
    "path": "preprocess/nerfmvl_loader.py",
    "chars": 1566,
    "preview": "from pathlib import Path\nimport numpy as np\n\n\nclass NeRFMVLLoader:\n    def __init__(self, nerf_mvl_root, class_name) -> "
  },
  {
    "path": "preprocess/nerfmvl_to_nerf.py",
    "chars": 3150,
    "preview": "import os\nfrom nerfmvl_loader import NeRFMVLLoader\nimport numpy as np\nimport json\nfrom pathlib import Path\n\n\ndef main():"
  },
  {
    "path": "readme.md",
    "chars": 7720,
    "preview": "<p align=\"center\">\n   <img src=\"./assets/lidar_nerf_logo_640.png\" width=\"480\" />\n</p>\n\n<h1 align=\"center\">LiDAR-NeRF: No"
  },
  {
    "path": "requirements.txt",
    "chars": 290,
    "preview": "torch-ema\ntorchmetrics\nninja\ntrimesh\nopencv-python\ntensorboardX\nnumpy\npandas\ntqdm\nmatplotlib\nPyMCubes\nrich\npysdf\ndearpyg"
  },
  {
    "path": "requirements_torch.txt",
    "chars": 36,
    "preview": "torch==2.0.0\ntorchvision\ntorchaudio\n"
  },
  {
    "path": "setup.py",
    "chars": 933,
    "preview": "from pathlib import Path\nfrom setuptools import setup\nimport re\n\n_pwd = Path(__file__).parent.absolute()\n\n\ndef main():\n "
  }
]

About this extraction

This page contains the full source code of the tangtaogo/lidar-nerf GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 90 files (579.5 KB), approximately 153.6k tokens, and a symbol index with 367 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Extract another repo