[
  {
    "path": "README.md",
    "content": "# **Papers Reading List.**\n- This is a collection of papers aiming at reducing model sizes or the ASIC/FPGA accelerator for Machine Learning, especially deep neural network related applications. (Inspiled by [Neural-Networks-on-Silicon](https://github.com/fengbintu/Neural-Networks-on-Silicon/blob/master/README.md))\n- Tutorials:\n   - **Hardware Accelerator**: Efficient Processing of Deep Neural Networks. ([link](https://arxiv.org/abs/1703.09039))\n   - **Model Compression**: Model Compression and Acceleration for Deep Neural Networks. ([link](https://arxiv.org/abs/1710.09282))\n## **Table of Contents**\n- [Our Contributions](#our-contributions)\n- [Network Compression](#network-compression)\n   - Parameter Sharing\n   - Teacher-Student Mechanism (Distilling)\n   - Fixed-precision training and storage\n   - Sparsity regularizers & Pruning\n   - Tensor Decomposition\n   - Conditional (Adaptive) Computing\n- [Hardware Accelerator](#hardware-accelerator)\n   - Benchmark and Platform Analysis\n   - Recurrent Neural Networks\n- [Conference Papers](#conference-papers)\n   - 2016: [NIPS](#nips-2016)\n   - 2017: [ICASSP](#icassp-2017)、[CVPR](#cvpr-2017)、[ICML](#icml-2017)、[ICCV](#iccv-2017)、[NIPS](#nips-2017)\n   - 2018：[ICLR](#iclr-2018)、[CVPR](#cvpr-2018)、[ECCV](#eccv-2018)、[ICML](#icml-2018)、[NIPS](#nips-2018)、[SysML](http://www.sysml.cc/2018/)\n   - 2019：[ICLR](#iclr-2019)、[CVPR](#cvpr-2019)、[SysML](https://www.sysml.cc/)\n##  **Our Contributions**\n- **TODO**\n##  **Network Compression**\n> **This field is changing rapidly, belowing entries may be somewhat antiquated.**\n### **Parameter Sharing**\n- **structured matrices**\n   - Structured Convolution Matrices for Energy-efficient Deep learning. (IBM Research–Almaden)\n   - Structured Transforms for Small-Footprint Deep Learning. (Google Inc)\n   - An Exploration of Parameter Redundancy in Deep Networks with Circulant Projections.\n   - Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank.\n- **Hashing**\n   - Functional Hashing for Compressing Neural Networks. (Baidu Inc)\n   - Compressing Neural Networks with the Hashing Trick. (Washington University + NVIDIA)\n- Learning compact recurrent neural networks. (University of Southern California + Google)\n\n### **Teacher-Student Mechanism (Distilling)**\n- Distilling the Knowledge in a Neural Network. (Google Inc)\n- Sequence-Level Knowledge Distillation. (Harvard University)\n- Like What You Like: Knowledge Distill via Neuron Selectivity Transfer. (TuSimple)\n\n### **Fixed-precision training and storage**\n- Binary/Ternary Neural Networks\n   - XNOR-Net, Ternary Weight Networks (TWNs), Binary-net and their variants.\n- Deep neural networks are robust to weight binarization and other non-linear distortions. (IBM Research–Almaden)\n- Recurrent Neural Networks With Limited Numerical Precision. (ETH Zurich + Montréal@Yoshua Bengio)\n- Neural Networks with Few Multiplications. (Montréal@Yoshua Bengio)\n- 1-Bit Stochastic Gradient Descent and its Application to Data-Parallel Distributed Training of Speech DNNs. (Tsinghua University + Microsoft)\n- Towards the Limit of Network Quantization. (Samsung US R&D Center)\n- Incremental Network Quantization_Towards Lossless CNNs with Low-precision Weights. (Intel Labs China)\n- Loss-aware Binarization of Deep Networks. (Hong Kong University of Science and Technology)\n- Trained Ternary Quantization. (Tsinghua University + Stanford University + NVIDIA)\n\n### **Sparsity regularizers & Pruning**\n- Learning both Weights and Connections for Efficient Neural Networks. (SongHan, Stanford University)\n- Deep Compression, EIE. (SongHan, Stanford University)\n- Dynamic Network Surgery for Efficient DNNs. (Intel)\n- Compression of Neural Machine Translation Models via Pruning. (Stanford University)\n- Accelerating Deep Convolutional Networks using low-precision and sparsity. (Intel)\n- Faster CNNs with Direct Sparse Convolutions and Guided Pruning. (Intel)\n- Exploring Sparsity in Recurrent Neural Networks. (Baidu Research)\n- Pruning Convolutional Neural Networks for Resource Efficient Inference. (NVIDIA)\n- Pruning Filters for Efficient ConvNets. (University of Maryland + NEC Labs America)\n- Soft Weight-Sharing for Neural Network Compression. (University of Amsterdam, [reddit discussion](https://www.reddit.com/r/MachineLearning/comments/5u7h3l/r_compressing_nn_with_shannons_blessing/))\n- Sparsely-Connected Neural Networks_Towards Efficient VLSI Implementation of Deep Neural Networks. (McGill University)\n- Training Compressed Fully-Connected Networks with a Density-Diversity Penalty. (University of Washington)\n- **Bayesian Compression**\n   - Bayesian Sparsification of Recurrent Neural Networks\n   - Bayesian Compression for Deep Learning\n   - Structured Bayesian Pruning via Log-Normal Multiplicative Noise\n\n### **Tensor Decomposition**\n- Compression of Deep Convolutional Neural Networks for Fast and Low Power Mobile Applications. (Samsung, etc)\n- Learning compact recurrent neural networks. (University of Southern California + Google)\n- Tensorizing Neural Networks. (Skolkovo Institute of Science and Technology, etc)\n- Ultimate tensorization_compressing convolutional and FC layers alike. (Moscow State University, etc)\n- Efficient and Accurate Approximations of Nonlinear Convolutional Networks. (@CVPR2015)\n- Exploiting Linear Structure Within Convolutional Networks for Efficient Evaluation. (New York University, etc.)\n- Convolutional neural networks with low-rank regularization. (Princeton University, etc.)\n- Learning with Tensors: Why Now and How? (Tensor-Learn Workshop @ NIPS'16)\n\n##  **Conditional (Adaptive) Computing**\n- Adaptive Computation Time for Recurrent Neural Networks. (Google DeepMind@Alex Graves)\n- Variable Computation in Recurrent Neural Networks. (New York University + Facebook AI Research)\n- Spatially Adaptive Computation Time for Residual Networks. ([github link](https://github.com/mfigurnov/sact), Google, etc.)\n- Hierarchical Multiscale Recurrent Neural Networks. (Montréal)\n- Outrageously Large Neural Networks_The Sparsely-Gated Mixture-of-Experts Layer. (Google Brain, etc.)\n- Adaptive Neural Networks for Fast Test-Time Prediction. (Boston University, etc)\n- Dynamic Deep Neural Networks_Optimizing Accuracy-Efficiency Trade-offs by Selective Execution. (University of Michigan)\n- **Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation**. (@Yoshua Bengio)\n- Multi-Scale Dense Convolutional Networks for Efficient Prediction. (Cornell University, etc)\n\n## **Hardware Accelerator**\n### **Benchmark and Platform Analysis**\n- Fathom: Reference Workloads for Modern Deep Learning Methods. (Harvard University)\n- DeepBench: Open-Source Tool for benchmarking DL operations. (svail.github.io-Baidu)\n- BENCHIP: Benchmarking Intelligence Processors.\n- [DAWNBench](https://dawn.cs.stanford.edu//benchmark/): An End-to-End Deep Learning Benchmark and Competition. (Stanford)\n- [MLPerf](https://mlperf.org/): A broad ML benchmark suite for measuring performance of ML software frameworks, ML hardware accelerators, and ML cloud platforms.\n\n### **Recurrent Neural Networks**\n- FPGA-based Low-power Speech Recognition with Recurrent Neural Networks. (Seoul National University)\n- Accelerating Recurrent Neural Networks in Analytics Servers: Comparison of FPGA, CPU, GPU, and ASIC. (Intel)\n- ESE: Efficient Speech Recognition Engine with Compressed LSTM on FPGA. (FPGA 2017, Best Paper Award)\n- DNPU: An 8.1TOPS/W Reconfigurable CNN-RNN Processor for GeneralPurpose Deep Neural Networks. (KAIST, ISSCC 2017)\n- Hardware Architecture of Bidirectional Long Short-Term Memory Neural Network for Optical Character Recognition. (University of Kaiserslautern, etc)\n- Efficient Hardware Mapping of Long Short-Term Memory Neural Networks for Automatic Speech Recognition. (Master Thesis@Georgios N. Evangelopoulos)\n- Hardware Accelerators for Recurrent Neural Networks on FPGA. (Purdue University, ISCAS 2017)\n- Accelerating Recurrent Neural Networks: A Memory Efficient Approach. (Nanjing University)\n- A Fast and Power Efficient Architecture to Parallelize LSTM based RNN for Cognitive Intelligence Applications.\n- An Energy-Efficient Reconfigurable Architecture for RNNs Using Dynamically Adaptive Approximate Computing.\n- A Systolically Scalable Accelerator for Near-Sensor Recurrent Neural Network Inference.\n- A High Energy Efficient Reconfigurable Hybrid Neural Network Processor for Deep Learning Applications\n- E-PUR: An Energy-Efficient Processing Unit for Recurrent Neural Networks\n- C-LSTM: Enabling Efficient LSTM using Structured Compression Techniques on FPGAs (FPGA 2018, Peking Univ, Syracuse Univ, CUNY)\n- DeltaRNN: A Power-efficient Recurrent Neural Network Accelerator. (FPGA 2018, ETHZ, BenevolentAI)\n- Towards Memory Friendly Long-Short Term Memory Networks (LSTMs) on Mobile GPUs (MACRO 2018)\n- E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs (HPCA 2019)\n\n\n### **Convolutional Neural Networks**\n- Please refer to  [Neural-Networks-on-Silicon](https://github.com/fengbintu/Neural-Networks-on-Silicon/blob/master/README.md)\n## **Conference Papers**\n\n### **NIPS 2016**\n-  Dynamic Network Surgery for Efficient DNNs. (Intel Labs China)\n-  Memory-Efficient Backpropagation Through Time. (Google DeepMind)\n-  PerforatedCNNs: Acceleration through Elimination of Redundant Convolutions. (Moscow State University, etc.)\n-  Learning Structured Sparsity in Deep Neural Networks. (University of Pittsburgh)\n-  LightRNN: Memory and Computation-Efficient Recurrent Neural Networks. (Nanjing University + Microsoft Research)\n\n### **ICASSP 2017**\n-\tlognet: energy-efficient neural networks using logarithmic computation. (Stanford University)\n-\textended low rank plus diagonal adaptation for deep and recurrent neural networks. (Microsoft)\n-\tfixed-point optimization of deep neural networks with adaptive step size retraining. (Seoul National University)\n-\timplementation of efficient, low power deep neural networks on next-generation intel client platforms (Demos). (Intel)\n-\tknowledge distillation for small-footprint highway networks. (TTI-Chicago, etc)\n-\tautomatic node selection for deep neural networks using group lasso regularization. (Doshisha University, etc)\n-\taccelerating deep convolutional networks using low-precision and sparsity. (Intel Labs)\n\n### **CVPR 2017**\n- Designing Energy-Efficient Convolutional Neural Networks using Energy-Aware Pruning. (MIT)\n- Network Sketching: Exploiting Binary Structure in Deep CNNs. (Intel Labs China + Tsinghua University)\n- Spatially Adaptive Computation Time for Residual Networks. (Google, etc)\n- A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation. (University of Pittsburgh, etc)\n\n\n### **ICML 2017**\n- Deep Tensor Convolution on Multicores. (MIT)\n- Beyond Filters: Compact Feature Map for Portable Deep Model. (Peking University + University of Sydney)\n- Combined Group and Exclusive Sparsity for Deep Neural Networks. (UNIST)\n- Delta Networks for Optimized Recurrent Network Computation. (Institute of Neuroinformatics, etc)\n- MEC: Memory-efficient Convolution for Deep Neural Network. (IBM Research)\n- Deciding How to Decide: Dynamic Routing in Artificial Neural Networks. (California Institute of Technology)\n- Training Models with End-to-End Low Precision: The Cans, the Cannots, and a Little Bit of Deep Learning. (ETH Zurich, etc)\n- Analytical Guarantees on Numerical Precision of Deep Neural Networks. (University of Illinois at Urbana-Champaign)\n- Variational Dropout Sparsifies Deep Neural Networks. (Skoltech, etc)\n- Adaptive Neural Networks for Fast Test-Time Prediction. (Boston University, etc)\n- Theoretical Properties for Neural Networks with Weight Matrices of Low Displacement Rank. (The City University of New York, etc)\n\n### **ICCV 2017**\n- Channel Pruning for Accelerating Very Deep Neural Networks. (Xi’an Jiaotong University + Megvii Inc.)\n- ThiNet: A Filter Level Pruning Method for Deep Neural Network Compression. (Nanjing University, etc)\n- Learning Efficient Convolutional Networks through Network Slimming. (Intel Labs China, etc)\n- Performance Guaranteed Network Acceleration via High-Order Residual Quantization. (Shanghai Jiao Tong University + Peking University)\n- Coordinating Filters for Faster Deep Neural Networks. (University of Pittsburgh + Duke University, etc, [github link](https://github.com/wenwei202/caffe))\n\n### **NIPS 2017**\n- Towards Accurate Binary Convolutional Neural Network. (DJI)\n- Soft-to-Hard Vector Quantization for End-to-End Learning Compressible Representations. (ETH Zurich)\n- TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning. (Duke University, etc, [github link](https://github.com/wenwei202/terngrad))\n- Flexpoint: An Adaptive Numerical Format for Efficient Training of Deep Neural Networks. (Intel)\n- Bayesian Compression for Deep Learning. (University of Amsterdam, etc)\n- Learning to Prune Deep Neural Networks via Layer-wise Optimal Brain Surgeon. (Nanyang Technological Univ)\n- Training Quantized Nets: A Deeper Understanding. (University of Maryland)\n- Structured Bayesian Pruning via Log-Normal Multiplicative Noise. (Yandex, etc)\n- Runtime Neural Pruning. (Tsinghua University)\n- The Reversible Residual Network: Backpropagation Without Storing Activations. (University of Toronto, [gihub link](https://github.com/renmengye/revnet-public))\n- Compression-aware Training of Deep Networks. (Toyota Research Institute + EPFL)\n\n### **ICLR 2018**\n- Oral\n   - Training and Inference with Integers in Deep Neural Networks. (Tsinghua University)\n- Poster\n   - Learning Sparse NNs Through L0 Regularization\n   - Learning Intrinsic Sparse Structures within Long Short-Term Memory\n   - Variantional Network Quantization\n   - Alternating Multi-BIT Quantization for Recurrent Neural Networks\n   - Mixed Precision Training\n   - Multi-Scale Dense Networks for Resource Efficient Image Classification\n   - efficient sparse-winograd CNNs\n   - Compressing Wrod Embedding via Deep Compositional Code Learning\n   - Mixed Precision Training of Convolutional Neural Networks using Integer Operations\n   - Adaptive Quantization of Neural Networks\n   - Espresso_Efficient Forward Propagation for Binary Deep Neural Networks\n   - WRPN_Wide Reduced-Precision Networks\n   - Deep Rewiring_Training very sparse deep networks\n   - Loss-aware Weight Quantization of Deep Network\n   - Learning to share_simultaneous parameter tying and sparsification in deep learning\n   - Deep Gradient Compression_Reducing the Communication Bandwidth for Distributed Training\n   - Large scale distributed neural network training through online distillation\n   - Learning Discrete Weights Using the Local Reparameterization Trick\n   - Rethinking the Smaller-Norm-Less-Informative Assumption in Channel Pruning of Convolution Layers\n   - Training wide residual networks for deployment using a single bit for each weight\n   - The High-Dimensional Geometry of Binary Neural Networks\n- workshop\n   - To Prune or Not to Prune_Exploring the Efficacy of Pruning for Model Compression\n### **CVPR 2018**\n- Shift: A Zero FLOP, Zero Parameter Alternative to Spatial Convolutions\n- ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices\n- Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference\n- BlockDrop: Dynamic Inference Paths in Residual Networks\n- SYQ: Learning Symmetric Quantization for Efficient Deep Neural Networks\n- Two-Step Quantization for Low-Bit Neural Networks\n- Towards Effective Low-Bitwidth Convolutional Neural Networks\n- Explicit Loss-Error-Aware Quantization for Low-Bit Deep Neural Networks\n- CLIP-Q: Deep Network Compression Learning by In-Parallel Pruning-Quantization\n- “Learning-Compression” Algorithms for Neural Net Pruning\n- Wide Compression: Tensor Ring Nets\n- NestedNet: Learning Nested Sparse Structures in Deep Neural Networks\n- Interleaved Structured Sparse Convolutional Neural Networks\n- NISP: Pruning Networks Using Neuron Importance Score Propagation\n- Learning Compact Recurrent Neural Networks With Block-Term Tensor Decomposition\n- HydraNets: Specialized Dynamic Architectures for Efficient Inference\n- Learning Time/Memory-Efficient Deep Architectures With Budgeted Super Networks\n### **ECCV 2018**\n- ShuffleNet V2: Practical Guidelines for Efficient CNN Architecture Design\n- A Systematic DNN Weight Pruning Framework using Alternating Direction Method of Multipliers\n- **Learning Compression from Limited Unlabeled Data**\n- **AMC: AutoML for Model Compression and Acceleration on Mobile Devices**\n- Training Binary Weight Networks via Semi-Binary Decomposition\n- Clustering Convolutional Kernels to Compress Deep Neural Networks\n- Bi-Real Net: Enhancing the Performance of 1-bit CNNs With Improved Representational Capability and Advanced Training Algorithm\n- Data-Driven Sparse Structure Selection for Deep Neural Networks\n- Coreset-Based Neural Network Compression\n- Convolutional Networks with Adaptive Inference Graphs\n- Value-aware Quantization for Training and Inference of Neural Networks\n- LQ-Nets: Learned Quantization for Highly Accurate and Compact Deep Neural Networks\n- Deep Expander Networks: Efficient Deep Networks from Graph Theory\n- Extreme Network Compression via Filter Group Approximation\n- Constraint-Aware Deep Neural Network Compression\n### **ICML 2018**\n- Compressing Neural Networks using the Variational Information Bottleneck\n- DCFNet_Deep Neural Network with Decomposed Convolutional Filters\n- Deep k-Means Re-Training and Parameter Sharing with Harder Cluster Assignments for Compressing Deep Convolutions\n- Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization\n- High Performance Zero-Memory Overhead Direct Convolutions\n- Kronecker Recurrent Units\n- Learning Compact Neural Networks with Regularization\n- StrassenNets_Deep Learning with a Multiplication Budge\n- Weightless_Lossy weight encoding for deep neural network compression\n- WSNet_Compact and Efficient Networks Through Weight Sampling\n\n### **NIPS 2018**\n- workshops\n   - [Systems for ML and Open Source Software](http://learningsys.org/nips18/schedule.html)\n   - [Compact Deep Neural Network Representation with Industrial Applications](https://openreview.net/group?id=NIPS.cc/2018/Workshop/CDNNRIA#accepted-papers)\n   - [2nd Workshop on Machine Learning on the Phone and other Consumer Devices (MLPCD 2)](https://sites.google.com/view/nips-2018-on-device-ml/call-for-papers)\n- 7761-scalable-methods-for-8-bit-training-of-neural-networks\n- 7382-frequency-domain-dynamic-pruning-for-convolutional-neural-networks\n- 7697-sparsified-sgd-with-memory\n- 7994-training-deep-neural-networks-with-8-bit-floating-point-numbers\n- 7358-kdgan-knowledge-distillation-with-generative-adversarial-networks\n- 7980-knowledge-distillation-by-on-the-fly-native-ensemble\n- 8292-multiple-instance-learning-for-efficient-sequential-data-classification-on-resource-constrained-devices\n- 7553-moonshine-distilling-with-cheap-convolutions\n- 7341-hitnet-hybrid-ternary-recurrent-neural-network\n- 8116-fastgrnn-a-fast-accurate-stable-and-tiny-kilobyte-sized-gated-recurrent-neural-network\n- 7327-training-dnns-with-hybrid-block-floating-point\n- 8117-reversible-recurrent-neural-networks\n- 485-norm-matters-efficient-and-accurate-normalization-schemes-in-deep-networks\n- 8218-synaptic-strength-for-convolutional-neural-network\n- 7666-tetris-tile-matching-the-tremendous-irregular-sparsity\n- 7644-learning-sparse-neural-networks-via-sensitivity-driven-regularization\n- 7466-pelee-a-real-time-object-detection-system-on-mobile-devices\n- 7433-learning-versatile-filters-for-efficient-convolutional-neural-networks\n- 7841-multi-task-zipping-via-layer-wise-neuron-sharing\n- 7519-a-linear-speedup-analysis-of-distributed-deep-learning-with-sparse-and-quantized-communication\n- 7759-gradiveq-vector-quantization-for-bandwidth-efficient-gradient-aggregation-in-distributed-cnn-training\n- 8191-atomo-communication-efficient-learning-via-atomic-sparsification\n- 7405-gradient-sparsification-for-communication-efficient-distributed-optimization\n\n### **ICLR 2019**\n- Poster:\n   - SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY\n   - Rethinking the Value of Network Pruning\n   - Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach\n   - Dynamic Channel Pruning: Feature Boosting and Suppression\n   - Energy-Constrained Compression for Deep Neural Networks via Weighted Sparse Projection and Layer Input Masking\n   - Slimmable Neural Networks\n   - RotDCF: Decomposition of Convolutional Filters for Rotation-Equivariant Deep Networks\n   - Dynamic Sparse Graph for Efficient Deep Learning\n   - Big-Little Net: An Efficient Multi-Scale Feature Representation for Visual and Speech Recognition\n   - Data-Dependent Coresets for Compressing Neural Networks with Applications to Generalization Bounds\n   - Learning Recurrent Binary/Ternary Weights\n   - Double Viterbi: Weight Encoding for High Compression Ratio and Fast On-Chip Reconstruction for Deep Neural Network\n   - Relaxed Quantization for Discretized Neural Networks\n   - Integer Networks for Data Compression with Latent-Variable Models\n   - Minimal Random Code Learning: Getting Bits Back from Compressed Model Parameters\n   - A Systematic Study of Binary Neural Networks' Optimisation\n   - Analysis of Quantized Models\n- Oral:\n   - The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks\n### **CVPR 2019**\n- All You Need is a Few Shifts: Designing Efficient Convolutional Neural Networks for Image Classification\n- Towards Optimal Structured CNN Pruning via Generative Adversarial Learning\n- T-Net: Parametrizing Fully Convolutional Nets with a Single High-Order Tensor\n- Fully Learnable Group Convolution for Acceleration of Deep Neural Networks\n- others to be added\n"
  },
  {
    "path": "llm_quant.md",
    "content": "## Summary/Tutorials\n- https://www.ningxuefei.cc/talks/llm-efficiency-intro_tutorialonly.pdf\n- https://arxiv.org/pdf/2401.15347.pdf\n## Extreme Low-Bit Quantization\n- QuIP: 2-Bit Quantization of Large Language Models With Guarantee\n- Extreme LLM Compression of Using Additive Quantization, https://arxiv.org/pdf/2401.06118.pdf\n- Enabling Fast 2-bit LLM on GPUs: Memory Alignment and Asynchronous Dequantization\n## Binarized LLM\n- PB-LLM: PARTIALLY BINARIZED LARGE LANGUAGE MODELS, https://arxiv.org/pdf/2310.00034.pdf\n- BitNet: Scaling 1-bit Transformers for Large Language Models\n- [blog: 1-bit Quantization: Run Models with Trillions of Parameters on Your Computer](https://medium.com/@bnjmn_marie/1-bit-quantization-run-models-with-trillions-of-parameters-on-your-computer-442617a61440)\n## Mixed-Precision Quantization\n- LLM-MQ: Mixed-precision Quantization for Efficient LLM Deployment, [link](https://nicsefc.ee.tsinghua.edu.cn/%2Fnics_file%2Fpdf%2F5c805adc-b555-499f-9882-5ca35ce674b5.pdf)\n## Compressed Model Evaluation\n- COMPRESSING LLMS: THE TRUTH IS RARELY PURE AND NEVER SIMPLE https://arxiv.org/pdf/2310.01382.pdf\n- Do Emergent Abilities Exist in Quantized Large Language Models: An Empirical Study\n## Nonlinear Quantization/New Data Format\n- FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design\n- ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks\n- A Leap Forward in LLMs Post-Training W4A8 Quantization Using Floating-Point Formats\n## Quantization with Compensation\n- SQUEEZELLM: DENSE-AND-SPARSE QUANTIZATION\n- SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression\n## System-Level Optimization\n- Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models\n- LLM in a flash: Efficient Large Language Model Inference with Limited Memory\n## KV cache compression/Activation Quantization\n- KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization (UC Berkeley)\n- Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge\n\n## Others\n- Z-FOLD: A Frustratingly Easy Post-Training Quantization Scheme for LLMs (Samsung Research)\n- Norm Tweaking: High-performance Low-bit Quantization of Large Language Models\n"
  }
]