Repository: amirgholami/ai_and_memory_wall
Branch: main
Commit: a1f1d676a6e5
Files: 3
Total size: 7.1 KB

Directory structure:
gitextract_opbd5mkb/

├── .gitignore
├── LICENSE
└── README.md

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
*~


================================================
FILE: LICENSE
================================================
MIT License

Copyright (c) 2021 Amir Gholami

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================
FILE: README.md
================================================
# Memory Footprint and FLOPs for SOTA Models in CV/NLP/Speech

This is a repository with the data used for the [AI and Memory Wall paper](https://arxiv.org/pdf/2403.14123.pdf). We report the number of paramters, feature size, as well as the total FLOPs for inference/training for SOTA models in CV, Speech Learning, and NLP. 


## NLP Models
We mostly focus on calculating the different metrics for transformer models, starting from the original BERT FLOPs for training/inference, as well as its parameters and memory footprint. We then calculate the same metrics for different BERT variations as reported in the table below.

P.S: The total PFLOPs required to train each model, is calculated by using the setup reported in each paper.


|    Date    |      Model      | Token Size |   #Params   | #Features | Inference GFLOPs | Training PFLOPs |
|------------|-----------------|------------|-------------|-----------|------------------|-----------------|
| 09/10/2014 | Seq2Seq         |            |             |           |                  | 11,000          |
| 12/06/2017 | Transformer     | 512        | 65M         | 77M       | 54               | 23,000          |
| 02/15/2018 | ELMo            |            | 94M         |           |                  | 3,300           |
| 10/11/2018 | BERT Large      | 512        | 330M        | 230M      | 340              | 250,000         |
| 06/11/2018 | GPT-1           | 512        | 110M        | 85M       | 96               | 57,000          |
| 02/14/2019 | GPT-2           | 1024       | 1,500M      | 2,000M    | 3,400            |                 |
| 07/26/2019 | RoBERTa Large   | 512        | 1,500M      | 2,000M    | 3,400            | 4,300,000       |
| 08/17/2019 | Megatron        | 1024       | 8,300M      | 4,700M    | 18,000           | 8,100,000       |
| 09/26/2019 | ALBERT xxl      | 512        | 235M        | 450M      | 2,500            | 31,000,000      |
| 02/13/2020 | Microsoft T-NLG | 1024       | 17,000M     | 5,700M    | 36,000           | 28,000,000      |
| 03/23/2020 | ELECTRA Large   | 128        | 330M        | 38M       | 79               | 3,100,000       |
| 05/28/2020 | GPT-3           | 2048       | 175,000M    | 63,000M   | 740,000          | 310,000,000     |
| 06/30/2020 | GShard          |            | 600,000M    |           |                  |                 |
| 06/20/2020 | Baidu RecSys-C  | N/A        | 2,000,000M  | N/A       | ~O(0.1)          | N/A             |
| 06/20/2020 | Baidu RecSys-E  | N/A        | 10,000,000M | N/A       | ~O(0.1)          | N/A             |


## CV Models
The table below reports the different metrics for various SOTA vision models, including the input image resolution, the number of parameters, the total inference GFLOPs, as well as the total PFLOPs required to train each model.

|    Date    |       Model       | Input Resolution | #Params | Inference GFLOPs | Training PFLOPs |
|------------|-------------------|------------------|---------|------------------|-----------------|
| 06/01/2012 | AlexNet           | 227 x 227        | 61M     |              1.4 | 460             |
| 09/04/2014 | VGG-19            | 224 x 224        | 138M    |               39 | 11,000          |
| 12/02/2015 | InceptionV3       | 299 x 299        | 24M     |              5.7 | 100,000         |
| 12/10/2015 | ResNet152         | 224 x 224        | 55M     |               23 | 11,000          |
| 02/26/2016 | InceptionV4       | 299 x 299        | 82M     |             24.6 |                 |
| 10/07/2016 | Xception          | 299 x 299        | 23M     |               17 | 450,000         |
| 11/16/2016 | ResNeXt101(64x4d) | 224 x 224        | 83M     |               31 | 12,000          |
| 12/03/2016 | DenseNet201       | 224 x 224        | 20M     |              8.9 | 2,800           |


# Memory Breakdown
The table below report the breakdown of memory required to train different SOTA models throughout the years. These include the total memory required to store the parameters, the memory footrpint associated with the optimization algorihtm, as well as the activation/feature memory.

| Year |         Model         | Input Resolution (Sentence length) | Batch Size | Params Memory | Optimizer Memory | Activation Memory | Total Memory |
|------|-----------------------|------------------------------------|------------|---------------|------------------|-------------------|--------------|
| 2012 | AlexNet               | 227 x 227                          |        128 | 0.23 GB       | 0.23 GB          | 0.71 GB           | 1.71 GB      |
| 2014 | VGG19                 | 224 x 224                          |         64 | 0.54 GB       | 0.54 GB          | 4.64 GB           | 5.72 GB      |
| 2015 | ResNet152             | 224 x 224                          |         32 | 0.22 GB       | 0.22 GB          | 5.14 GB           | 5.58 GB      |
| 2016 | DenseNet201           | 224 x 224                          |         32 | 0.07 GB       | 0.07 GB          | 6.04 GB           | 6.18 GB      |
| 2016 | ResNeXt101 (64x4d)    | 224 x 224                          |         32 | 0.31 GB       | 0.31 GB          | 7.34 GB           | 7.96 GB      |
| 2017 | Transformer Big (WMT) | 512                                |          6 | 1.02 GB       | 2.04 GB          | 11.78 GB          | 14.84 GB     |
| 2018 | BERT Large            | 512                                |         16 | 1.32 GB       | 2.64 GB          | 14.38 GB          | 18.34 GB     |
| 2019 | GPT-2                 | 2014                               |          1 | 5.86 GB       | 11.62 GB         | 8.63 GB           | 26.21 GB     |


# Acknowledgments
We appreciate it if you would please cite the following paper if you found the library useful for your work:

```text
Gholami A, Yao Z, Kim S, Mahoney MW, Keutzer K. AI and Memory Wall. RiseLab Medium Blog Post, University of Califonia Berkeley, 2021, March 29.
```

```text
@article{gholami2020ai_and_memory_wall,
  title={AI and Memory Wall},
  author={ Gholami, Amir and Yao, Zhewei and Kim, Sehoon and Hooper, Coleman and Mahoney, Michael W, and Keutzer, Kurt},
  journal={IEEE Micro Journal},
  year={2024}
}
```