Full Code of stas00/ml-engineering for AI

master 1a78c02d0302 cached
114 files
1.0 MB
269.3k tokens
257 symbols
1 requests
Download .txt
Showing preview only (1,080K chars total). Download the full file or copy to clipboard to get everything.
Repository: stas00/ml-engineering
Branch: master
Commit: 1a78c02d0302
Files: 114
Total size: 1.0 MB

Directory structure:
gitextract_lov69rop/

├── .gitignore
├── LICENSE-CC-BY-SA
├── Makefile
├── README.md
├── build/
│   ├── README.md
│   ├── linkcheckerrc
│   ├── mdbook/
│   │   ├── md-to-html.py
│   │   ├── mv-links.py
│   │   ├── preprocess-html-for-epub.py
│   │   └── utils/
│   │       ├── build_utils.py
│   │       └── github_md_utils.py
│   ├── prince_style.css
│   └── requirements.txt
├── chapters-md.txt
├── compute/
│   ├── README.md
│   ├── accelerator/
│   │   ├── README.md
│   │   ├── amd/
│   │   │   ├── debug.md
│   │   │   └── performance.md
│   │   ├── benchmarks/
│   │   │   ├── README.md
│   │   │   └── mamf-finder.py
│   │   └── nvidia/
│   │       └── debug.md
│   ├── cpu/
│   │   └── README.md
│   └── cpu-memory/
│       └── README.md
├── contributors.md
├── debug/
│   ├── NicerTrace.py
│   ├── README.md
│   ├── make-tiny-models-tokenizers-datasets.md
│   ├── nccl-performance-debug.md
│   ├── pytorch.md
│   ├── tiny-scripts/
│   │   ├── README.md
│   │   ├── c4-en-10k.py
│   │   ├── cm4-synthetic-testing.py
│   │   ├── fsmt-make-super-tiny-model.py
│   │   ├── general-pmd-ds-unpack.py
│   │   ├── general-pmd-synthetic-testing.py
│   │   ├── idefics-make-tiny-model.py
│   │   ├── m4-ds-unpack.py
│   │   ├── mt5-make-tiny-model.py
│   │   ├── openwebtext-10k.py
│   │   └── oscar-en-10k.py
│   ├── tools.md
│   ├── torch-distributed-gpu-test.py
│   ├── torch-distributed-hanging-solutions.md
│   ├── underflow_overflow.md
│   └── underflow_overflow.py
├── inference/
│   └── README.md
├── insights/
│   ├── ai-battlefield.md
│   └── how-to-choose-cloud-provider.md
├── model-parallelism/
│   └── README.md
├── network/
│   ├── README.md
│   ├── benchmarks/
│   │   ├── README.md
│   │   ├── all_gather_object_vs_all_gather.py
│   │   ├── all_gather_object_vs_all_reduce.py
│   │   ├── all_reduce_bench.py
│   │   ├── all_reduce_bench_pyxis.sbatch
│   │   ├── all_reduce_latency_comp.py
│   │   └── results/
│   │       ├── README.md
│   │       └── disable-nvlink.md
│   ├── comms.md
│   └── debug/
│       └── README.md
├── orchestration/
│   ├── README.md
│   ├── kubernetes/
│   │   └── README.md
│   └── slurm/
│       ├── README.md
│       ├── admin.md
│       ├── cron-daily.slurm
│       ├── cron-hourly.slurm
│       ├── example.slurm
│       ├── launchers/
│       │   ├── README.md
│       │   ├── accelerate-launcher.slurm
│       │   ├── lightning-launcher.slurm
│       │   ├── srun-launcher.slurm
│       │   └── torchrun-launcher.slurm
│       ├── performance.md
│       ├── undrain-good-nodes.sh
│       └── users.md
├── resources/
│   └── README.md
├── stabs/
│   ├── README.md
│   └── incoming.md
├── storage/
│   ├── README.md
│   ├── benchmarks/
│   │   └── results/
│   │       └── hope-2023-12-20-14-37-02-331702-summary.md
│   ├── fio-json-extract.py
│   └── fio-scan
├── testing/
│   ├── README.md
│   └── testing_utils.py
├── todo.md
└── training/
    ├── README.md
    ├── checkpoints/
    │   ├── README.md
    │   ├── torch-checkpoint-convert-to-bf16
    │   └── torch-checkpoint-shrink.py
    ├── datasets.md
    ├── dtype.md
    ├── emulate-multi-node.md
    ├── fault-tolerance/
    │   ├── README.md
    │   ├── fs-watchdog.py
    │   ├── fs-watchdog.slurm
    │   ├── slurm-status.py
    │   └── slurm-status.slurm
    ├── hparams.md
    ├── instabilities/
    │   ├── README.md
    │   └── training-loss-patterns.md
    ├── model-parallelism/
    │   └── README.md
    ├── performance/
    │   ├── README.md
    │   ├── benchmarks/
    │   │   ├── activation-memory-per-layer.py
    │   │   ├── dataloader/
    │   │   │   ├── num-workers-bench.py
    │   │   │   └── pin-memory-non-block-bench.py
    │   │   ├── matrix-shape/
    │   │   │   └── swiglu-maf-bench.py
    │   │   └── numa/
    │   │       ├── numa-set-pynvml.py
    │   │       └── numa-set.sh
    │   └── distributed/
    │       └── torch-dist-mem-usage.py
    ├── re-train-hub-models.md
    ├── reproducibility/
    │   └── README.md
    └── tools/
        ├── main_process_first.py
        ├── multi-gpu-non-interleaved-print.py
        └── printflock.py

================================================
FILE CONTENTS
================================================

================================================
FILE: .gitignore
================================================
# HTML build
*.html
chapters-html.txt

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
#  Usually these files are written by a python script from a template
#  before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
.python-version

# pipenv
#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
#   However, in case of collaboration, if having platform-specific dependencies or dependencies
#   having no cross-platform support, pipenv may install dependencies that don't work, or not
#   install all needed dependencies.
#Pipfile.lock

# PEP 582; used by e.g. github.com/David-OConnor/pyflow
__pypackages__/

# Celery stuff
celerybeat-schedule
celerybeat.pid

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/


================================================
FILE: LICENSE-CC-BY-SA
================================================
Attribution-ShareAlike 4.0 International

=======================================================================

Creative Commons Corporation ("Creative Commons") is not a law firm and
does not provide legal services or legal advice. Distribution of
Creative Commons public licenses does not create a lawyer-client or
other relationship. Creative Commons makes its licenses and related
information available on an "as-is" basis. Creative Commons gives no
warranties regarding its licenses, any material licensed under their
terms and conditions, or any related information. Creative Commons
disclaims all liability for damages resulting from their use to the
fullest extent possible.

Using Creative Commons Public Licenses

Creative Commons public licenses provide a standard set of terms and
conditions that creators and other rights holders may use to share
original works of authorship and other material subject to copyright
and certain other rights specified in the public license below. The
following considerations are for informational purposes only, are not
exhaustive, and do not form part of our licenses.

     Considerations for licensors: Our public licenses are
     intended for use by those authorized to give the public
     permission to use material in ways otherwise restricted by
     copyright and certain other rights. Our licenses are
     irrevocable. Licensors should read and understand the terms
     and conditions of the license they choose before applying it.
     Licensors should also secure all rights necessary before
     applying our licenses so that the public can reuse the
     material as expected. Licensors should clearly mark any
     material not subject to the license. This includes other CC-
     licensed material, or material used under an exception or
     limitation to copyright. More considerations for licensors:
	wiki.creativecommons.org/Considerations_for_licensors

     Considerations for the public: By using one of our public
     licenses, a licensor grants the public permission to use the
     licensed material under specified terms and conditions. If
     the licensor's permission is not necessary for any reason--for
     example, because of any applicable exception or limitation to
     copyright--then that use is not regulated by the license. Our
     licenses grant only permissions under copyright and certain
     other rights that a licensor has authority to grant. Use of
     the licensed material may still be restricted for other
     reasons, including because others have copyright or other
     rights in the material. A licensor may make special requests,
     such as asking that all changes be marked or described.
     Although not required by our licenses, you are encouraged to
     respect those requests where reasonable. More_considerations
     for the public:
	wiki.creativecommons.org/Considerations_for_licensees

=======================================================================

Creative Commons Attribution-ShareAlike 4.0 International Public
License

By exercising the Licensed Rights (defined below), You accept and agree
to be bound by the terms and conditions of this Creative Commons
Attribution-ShareAlike 4.0 International Public License ("Public
License"). To the extent this Public License may be interpreted as a
contract, You are granted the Licensed Rights in consideration of Your
acceptance of these terms and conditions, and the Licensor grants You
such rights in consideration of benefits the Licensor receives from
making the Licensed Material available under these terms and
conditions.


Section 1 -- Definitions.

  a. Adapted Material means material subject to Copyright and Similar
     Rights that is derived from or based upon the Licensed Material
     and in which the Licensed Material is translated, altered,
     arranged, transformed, or otherwise modified in a manner requiring
     permission under the Copyright and Similar Rights held by the
     Licensor. For purposes of this Public License, where the Licensed
     Material is a musical work, performance, or sound recording,
     Adapted Material is always produced where the Licensed Material is
     synched in timed relation with a moving image.

  b. Adapter's License means the license You apply to Your Copyright
     and Similar Rights in Your contributions to Adapted Material in
     accordance with the terms and conditions of this Public License.

  c. BY-SA Compatible License means a license listed at
     creativecommons.org/compatiblelicenses, approved by Creative
     Commons as essentially the equivalent of this Public License.

  d. Copyright and Similar Rights means copyright and/or similar rights
     closely related to copyright including, without limitation,
     performance, broadcast, sound recording, and Sui Generis Database
     Rights, without regard to how the rights are labeled or
     categorized. For purposes of this Public License, the rights
     specified in Section 2(b)(1)-(2) are not Copyright and Similar
     Rights.

  e. Effective Technological Measures means those measures that, in the
     absence of proper authority, may not be circumvented under laws
     fulfilling obligations under Article 11 of the WIPO Copyright
     Treaty adopted on December 20, 1996, and/or similar international
     agreements.

  f. Exceptions and Limitations means fair use, fair dealing, and/or
     any other exception or limitation to Copyright and Similar Rights
     that applies to Your use of the Licensed Material.

  g. License Elements means the license attributes listed in the name
     of a Creative Commons Public License. The License Elements of this
     Public License are Attribution and ShareAlike.

  h. Licensed Material means the artistic or literary work, database,
     or other material to which the Licensor applied this Public
     License.

  i. Licensed Rights means the rights granted to You subject to the
     terms and conditions of this Public License, which are limited to
     all Copyright and Similar Rights that apply to Your use of the
     Licensed Material and that the Licensor has authority to license.

  j. Licensor means the individual(s) or entity(ies) granting rights
     under this Public License.

  k. Share means to provide material to the public by any means or
     process that requires permission under the Licensed Rights, such
     as reproduction, public display, public performance, distribution,
     dissemination, communication, or importation, and to make material
     available to the public including in ways that members of the
     public may access the material from a place and at a time
     individually chosen by them.

  l. Sui Generis Database Rights means rights other than copyright
     resulting from Directive 96/9/EC of the European Parliament and of
     the Council of 11 March 1996 on the legal protection of databases,
     as amended and/or succeeded, as well as other essentially
     equivalent rights anywhere in the world.

  m. You means the individual or entity exercising the Licensed Rights
     under this Public License. Your has a corresponding meaning.


Section 2 -- Scope.

  a. License grant.

       1. Subject to the terms and conditions of this Public License,
          the Licensor hereby grants You a worldwide, royalty-free,
          non-sublicensable, non-exclusive, irrevocable license to
          exercise the Licensed Rights in the Licensed Material to:

            a. reproduce and Share the Licensed Material, in whole or
               in part; and

            b. produce, reproduce, and Share Adapted Material.

       2. Exceptions and Limitations. For the avoidance of doubt, where
          Exceptions and Limitations apply to Your use, this Public
          License does not apply, and You do not need to comply with
          its terms and conditions.

       3. Term. The term of this Public License is specified in Section
          6(a).

       4. Media and formats; technical modifications allowed. The
          Licensor authorizes You to exercise the Licensed Rights in
          all media and formats whether now known or hereafter created,
          and to make technical modifications necessary to do so. The
          Licensor waives and/or agrees not to assert any right or
          authority to forbid You from making technical modifications
          necessary to exercise the Licensed Rights, including
          technical modifications necessary to circumvent Effective
          Technological Measures. For purposes of this Public License,
          simply making modifications authorized by this Section 2(a)
          (4) never produces Adapted Material.

       5. Downstream recipients.

            a. Offer from the Licensor -- Licensed Material. Every
               recipient of the Licensed Material automatically
               receives an offer from the Licensor to exercise the
               Licensed Rights under the terms and conditions of this
               Public License.

            b. Additional offer from the Licensor -- Adapted Material.
               Every recipient of Adapted Material from You
               automatically receives an offer from the Licensor to
               exercise the Licensed Rights in the Adapted Material
               under the conditions of the Adapter's License You apply.

            c. No downstream restrictions. You may not offer or impose
               any additional or different terms or conditions on, or
               apply any Effective Technological Measures to, the
               Licensed Material if doing so restricts exercise of the
               Licensed Rights by any recipient of the Licensed
               Material.

       6. No endorsement. Nothing in this Public License constitutes or
          may be construed as permission to assert or imply that You
          are, or that Your use of the Licensed Material is, connected
          with, or sponsored, endorsed, or granted official status by,
          the Licensor or others designated to receive attribution as
          provided in Section 3(a)(1)(A)(i).

  b. Other rights.

       1. Moral rights, such as the right of integrity, are not
          licensed under this Public License, nor are publicity,
          privacy, and/or other similar personality rights; however, to
          the extent possible, the Licensor waives and/or agrees not to
          assert any such rights held by the Licensor to the limited
          extent necessary to allow You to exercise the Licensed
          Rights, but not otherwise.

       2. Patent and trademark rights are not licensed under this
          Public License.

       3. To the extent possible, the Licensor waives any right to
          collect royalties from You for the exercise of the Licensed
          Rights, whether directly or through a collecting society
          under any voluntary or waivable statutory or compulsory
          licensing scheme. In all other cases the Licensor expressly
          reserves any right to collect such royalties.


Section 3 -- License Conditions.

Your exercise of the Licensed Rights is expressly made subject to the
following conditions.

  a. Attribution.

       1. If You Share the Licensed Material (including in modified
          form), You must:

            a. retain the following if it is supplied by the Licensor
               with the Licensed Material:

                 i. identification of the creator(s) of the Licensed
                    Material and any others designated to receive
                    attribution, in any reasonable manner requested by
                    the Licensor (including by pseudonym if
                    designated);

                ii. a copyright notice;

               iii. a notice that refers to this Public License;

                iv. a notice that refers to the disclaimer of
                    warranties;

                 v. a URI or hyperlink to the Licensed Material to the
                    extent reasonably practicable;

            b. indicate if You modified the Licensed Material and
               retain an indication of any previous modifications; and

            c. indicate the Licensed Material is licensed under this
               Public License, and include the text of, or the URI or
               hyperlink to, this Public License.

       2. You may satisfy the conditions in Section 3(a)(1) in any
          reasonable manner based on the medium, means, and context in
          which You Share the Licensed Material. For example, it may be
          reasonable to satisfy the conditions by providing a URI or
          hyperlink to a resource that includes the required
          information.

       3. If requested by the Licensor, You must remove any of the
          information required by Section 3(a)(1)(A) to the extent
          reasonably practicable.

  b. ShareAlike.

     In addition to the conditions in Section 3(a), if You Share
     Adapted Material You produce, the following conditions also apply.

       1. The Adapter's License You apply must be a Creative Commons
          license with the same License Elements, this version or
          later, or a BY-SA Compatible License.

       2. You must include the text of, or the URI or hyperlink to, the
          Adapter's License You apply. You may satisfy this condition
          in any reasonable manner based on the medium, means, and
          context in which You Share Adapted Material.

       3. You may not offer or impose any additional or different terms
          or conditions on, or apply any Effective Technological
          Measures to, Adapted Material that restrict exercise of the
          rights granted under the Adapter's License You apply.


Section 4 -- Sui Generis Database Rights.

Where the Licensed Rights include Sui Generis Database Rights that
apply to Your use of the Licensed Material:

  a. for the avoidance of doubt, Section 2(a)(1) grants You the right
     to extract, reuse, reproduce, and Share all or a substantial
     portion of the contents of the database;

  b. if You include all or a substantial portion of the database
     contents in a database in which You have Sui Generis Database
     Rights, then the database in which You have Sui Generis Database
     Rights (but not its individual contents) is Adapted Material,

     including for purposes of Section 3(b); and
  c. You must comply with the conditions in Section 3(a) if You Share
     all or a substantial portion of the contents of the database.

For the avoidance of doubt, this Section 4 supplements and does not
replace Your obligations under this Public License where the Licensed
Rights include other Copyright and Similar Rights.


Section 5 -- Disclaimer of Warranties and Limitation of Liability.

  a. UNLESS OTHERWISE SEPARATELY UNDERTAKEN BY THE LICENSOR, TO THE
     EXTENT POSSIBLE, THE LICENSOR OFFERS THE LICENSED MATERIAL AS-IS
     AND AS-AVAILABLE, AND MAKES NO REPRESENTATIONS OR WARRANTIES OF
     ANY KIND CONCERNING THE LICENSED MATERIAL, WHETHER EXPRESS,
     IMPLIED, STATUTORY, OR OTHER. THIS INCLUDES, WITHOUT LIMITATION,
     WARRANTIES OF TITLE, MERCHANTABILITY, FITNESS FOR A PARTICULAR
     PURPOSE, NON-INFRINGEMENT, ABSENCE OF LATENT OR OTHER DEFECTS,
     ACCURACY, OR THE PRESENCE OR ABSENCE OF ERRORS, WHETHER OR NOT
     KNOWN OR DISCOVERABLE. WHERE DISCLAIMERS OF WARRANTIES ARE NOT
     ALLOWED IN FULL OR IN PART, THIS DISCLAIMER MAY NOT APPLY TO YOU.

  b. TO THE EXTENT POSSIBLE, IN NO EVENT WILL THE LICENSOR BE LIABLE
     TO YOU ON ANY LEGAL THEORY (INCLUDING, WITHOUT LIMITATION,
     NEGLIGENCE) OR OTHERWISE FOR ANY DIRECT, SPECIAL, INDIRECT,
     INCIDENTAL, CONSEQUENTIAL, PUNITIVE, EXEMPLARY, OR OTHER LOSSES,
     COSTS, EXPENSES, OR DAMAGES ARISING OUT OF THIS PUBLIC LICENSE OR
     USE OF THE LICENSED MATERIAL, EVEN IF THE LICENSOR HAS BEEN
     ADVISED OF THE POSSIBILITY OF SUCH LOSSES, COSTS, EXPENSES, OR
     DAMAGES. WHERE A LIMITATION OF LIABILITY IS NOT ALLOWED IN FULL OR
     IN PART, THIS LIMITATION MAY NOT APPLY TO YOU.

  c. The disclaimer of warranties and limitation of liability provided
     above shall be interpreted in a manner that, to the extent
     possible, most closely approximates an absolute disclaimer and
     waiver of all liability.


Section 6 -- Term and Termination.

  a. This Public License applies for the term of the Copyright and
     Similar Rights licensed here. However, if You fail to comply with
     this Public License, then Your rights under this Public License
     terminate automatically.

  b. Where Your right to use the Licensed Material has terminated under
     Section 6(a), it reinstates:

       1. automatically as of the date the violation is cured, provided
          it is cured within 30 days of Your discovery of the
          violation; or

       2. upon express reinstatement by the Licensor.

     For the avoidance of doubt, this Section 6(b) does not affect any
     right the Licensor may have to seek remedies for Your violations
     of this Public License.

  c. For the avoidance of doubt, the Licensor may also offer the
     Licensed Material under separate terms or conditions or stop
     distributing the Licensed Material at any time; however, doing so
     will not terminate this Public License.

  d. Sections 1, 5, 6, 7, and 8 survive termination of this Public
     License.


Section 7 -- Other Terms and Conditions.

  a. The Licensor shall not be bound by any additional or different
     terms or conditions communicated by You unless expressly agreed.

  b. Any arrangements, understandings, or agreements regarding the
     Licensed Material not stated herein are separate from and
     independent of the terms and conditions of this Public License.


Section 8 -- Interpretation.

  a. For the avoidance of doubt, this Public License does not, and
     shall not be interpreted to, reduce, limit, restrict, or impose
     conditions on any use of the Licensed Material that could lawfully
     be made without permission under this Public License.

  b. To the extent possible, if any provision of this Public License is
     deemed unenforceable, it shall be automatically reformed to the
     minimum extent necessary to make it enforceable. If the provision
     cannot be reformed, it shall be severed from this Public License
     without affecting the enforceability of the remaining terms and
     conditions.

  c. No term or condition of this Public License will be waived and no
     failure to comply consented to unless expressly agreed to by the
     Licensor.

  d. Nothing in this Public License constitutes or may be interpreted
     as a limitation upon, or waiver of, any privileges and immunities
     that apply to the Licensor or You, including from the legal
     processes of any jurisdiction or authority.


=======================================================================

Creative Commons is not a party to its public
licenses. Notwithstanding, Creative Commons may elect to apply one of
its public licenses to material it publishes and in those instances
will be considered the “Licensor.” The text of the Creative Commons
public licenses is dedicated to the public domain under the CC0 Public
Domain Dedication. Except for the limited purpose of indicating that
material is shared under a Creative Commons public license or as
otherwise permitted by the Creative Commons policies published at
creativecommons.org/policies, Creative Commons does not authorize the
use of the trademark "Creative Commons" or any other trademark or logo
of Creative Commons without its prior written consent including,
without limitation, in connection with any unauthorized modifications
to any of its public licenses or any other arrangements,
understandings, or agreements concerning use of licensed material. For
the avoidance of doubt, this paragraph does not form part of the
public licenses.

Creative Commons may be contacted at creativecommons.org.


================================================
FILE: Makefile
================================================
# usage: make help

.PHONY: help spell prep-html-files html html-local pdf epub upload check-links-local check-links-all clean
.DEFAULT_GOAL := help

help: ## this help
	@awk 'BEGIN {FS = ":.*##"; printf "\nUsage:\n  make \033[36m<target>\033[0m\n"} /^[a-zA-Z_-]+:.*?##/ { printf "  \033[36m%-22s\033[0m %s\n", $$1, $$2 } /^##@/ { printf "\n\033[1m%s\033[0m\n", substr($$0, 5) } ' $(MAKEFILE_LIST)

# pip install codespell
spell: ## spellcheck
	@codespell --write-changes --skip "*.pdf" --skip "*.json"

prep-html-files: ## prepare html-files
	echo book-front.html > chapters-html.txt
	perl -ne 's|\.md|.html|; print' chapters-md.txt >> chapters-html.txt

html: prep-html-files ## make html version w/ scripts linking to their url at my github repo
	python build/mdbook/md-to-html.py

html-local: prep-html-files ## make html version w/ scripts remaining local
	python build/mdbook/md-to-html.py --local

pdf: html ## make pdf version (from html files)
	prince --no-author-style -s build/prince_style.css --pdf-title="Stas Bekman - Machine Learning Engineering ($$(date))" -o "Stas Bekman - Machine Learning Engineering.pdf" $$(cat chapters-html.txt | tr "\n" " ")

epub: html ## make epub version (from html files)
	python build/mdbook/preprocess-html-for-epub.py && \
	pandoc --from html --to epub3 \
		--output "Stas Bekman - Machine Learning Engineering.epub" \
		--metadata title="Machine Learning Engineering" \
		--metadata author="Stas Bekman" \
		--metadata date="$$(date +%Y-%m-%d)" \
		--metadata language="en" \
		--epub-cover-image=images/Machine-Learning-Engineering-book-cover.png \
		--resource-path=.:$$(cat chapters-html.txt | xargs -n1 dirname | awk '!seen[$$0]++' | tr "\n" ":") \
		$$(cat chapters-html.txt | tr "\n" " ")

upload: pdf epub ## upload pdf to the hub
	cp "Stas Bekman - Machine Learning Engineering.pdf" ml-engineering-book/
	cp "Stas Bekman - Machine Learning Engineering.epub" ml-engineering-book/
	cd ml-engineering-book/ && git commit -m "new version" "Stas Bekman - Machine Learning Engineering.pdf" "Stas Bekman - Machine Learning Engineering.epub" && git push

check-links-local: html-local ## check local links
	linkchecker --config build/linkcheckerrc $$(cat chapters-html.txt | tr "\n" " ") | tee linkchecker-local.txt

check-links-all: html ## check all links including external ones
	linkchecker --config build/linkcheckerrc $$(cat chapters-html.txt | tr "\n" " ") --check-extern --user-agent="Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:121.0) Gecko/20100101 Firefox/121.0" | tee linkchecker-all.txt

clean: ## remove build files
	find . -name "*html" -exec rm {} \;


================================================
FILE: README.md
================================================
# Machine Learning Engineering Open Book

This is an open collection of methodologies, tools and step by step instructions to help with successful training and fine-tuning of large language models and multi-modal models and their inference.

This is a technical material suitable for LLM/VLM training engineers and operators. That is the content here contains lots of scripts and copy-n-paste commands to enable you to quickly address your needs.

This repo is an ongoing brain dump of my experiences training Large Language Models (LLM) (and VLMs); a lot of the know-how I acquired while training the open-source [BLOOM-176B](https://huggingface.co/bigscience/bloom) model in 2022 and [IDEFICS-80B](https://huggingface.co/HuggingFaceM4/idefics-80b-instruct) multi-modal model in 2023, and RAG models at [Contextual.AI](https://contextual.ai/) in 2024.

I've been compiling this information mostly for myself so that I could quickly find solutions I have already researched in the past and which have worked, but as usual I'm happy to share these notes with the wider ML community.


## Table of Contents


**Part 1. Insights**

1. **[The AI Battlefield Engineering](./insights/ai-battlefield.md)** - what you need to know in order to succeed.

1. **[How to Choose a Cloud Provider](./insights/how-to-choose-cloud-provider.md)** - these questions will empower you to have a successful compute cloud experience.

**Part 2. Hardware**

1. **[Compute](compute)** - accelerators, CPUs, CPU memory.

1. **[Storage](storage)** - local, distributed and shared file systems.

1. **[Network](network)** - intra- and inter-node networking.


**Part 3. Orchestration**

1. **[Orchestration Systems](orchestration)** - managing containers and resources
1. **[SLURM](orchestration/slurm)** - Simple Linux Utility for Resource Management


**Part 4. Training**

1. **[Training](training)** - model training-related guides


**Part 5. Inference**

1. **[Inference](inference)** - model inference insights


**Part 6. Development**

1. **[Debugging and Troubleshooting](debug)** - how to debug easy and difficult issues

1. **[And more debugging](https://github.com/stas00/the-art-of-debugging)**

1. **[Testing](testing)** - numerous tips and tools to make test writing enjoyable


**Part 7. Miscellaneous**

1. **[Resources](resources)** - LLM/VLM chronicles


## Updates

I announce any significant updates on my twitter channel [https://twitter.com/StasBekman](https://twitter.com/StasBekman).

## Ebook versions of the book

You can download various ebook formats of this book:
* [PDF](https://huggingface.co/stas/ml-engineering-book/resolve/main/Stas%20Bekman%20-%20Machine%20Learning%20Engineering.pdf?download=true)
* [EPUB](https://huggingface.co/stas/ml-engineering-book/resolve/main/Stas%20Bekman%20-%20Machine%20Learning%20Engineering.epub?download=true)


I will try to rebuild these once in a few weeks or so, but if you want the latest ebook versions, the instructions for building are [here](build).

Thanks to HuggingFace for giving me permission to host my book's ebook formats at the [HF hub](https://huggingface.co/stas/ml-engineering-book).

## The Art of Debugging

Make sure to also read  [The Art of Debugging Open book](https://github.com/stas00/the-art-of-debugging) which has a lot of related recipes and methodologies.

## Lectures/Talks

- [Building resilient ML Engineering skills](https://www.youtube.com/watch?v=IBJUt9JPKHk) given on 2026-01-10 for the [GPU Mode community](https://github.com/gpu-mode). Only had time to discuss performance reality of accelerators, network and storage and how each of them can be crucial to the ensemble's performance. Thanks to [Mark Saroufim](https://github.com/msaroufim) for organizing and providing an awesome support during the talk.

## Discussions

If you want to discuss something related to ML engineering this repo has the [community discussions](https://github.com/stas00/ml-engineering/discussions) available - so please don't hesitate to share your experience or start a new discussion about something you're passionate about.

## Key comparison tables

High end accelerators:

- [Theoretical accelerator TFLOPS](compute/accelerator#tflops-comparison-table)
- [Accelerator memory size and speed](compute/accelerator#accelerator-memory-size-and-speed)

Networks:

- [Theoretical inter-node speed](network#inter-node-networking)
- [Theoretical intra-node speed](network#intra-node-networking)

## Shortcuts

Things that you are likely to need to find quickly and often.

Tools:

- [all_reduce_bench.py](network/benchmarks/all_reduce_bench.py) - a much easier way to benchmark network throughput than nccl-tests.
- [torch-distributed-gpu-test.py](debug/torch-distributed-gpu-test.py) - a tool to quickly test your inter-node connectivity
- [mamf-finder.py](compute/accelerator/benchmarks/mamf-finder.py) - what is the actual TFLOPS measurement you can get from your accelerator.

Guides:

- [debugging pytorch applications](debug/pytorch.md) - quick copy-n-paste solutions to resolve hanging or breaking pytorch applications
- [slurm for users](orchestration/slurm/users.md) - a slurm cheatsheet and tricks
- [make tiny models/datasets/tokenizers](debug/make-tiny-models-tokenizers-datasets.md)
- [LLM/VLM chronicles collection](resources#publicly-available-training-llmvlm-logbooks)


## Gratitude

None of this would have been possible without me being entrusted with doing the specific LLM/VLM trainings I have learned the initial know-how from. This is a privilege that only a few enjoy due to the prohibitively expensive cost of renting huge ML compute clusters. So hopefully the rest of the ML community will vicariously learn from these notes.

Special thanks go to [Thom Wolf](https://github.com/thomwolf) who proposed that I lead the BLOOM-176B training back when I didn't know anything about large scale training. This was the project that catapulted me into the intense learning process. And, of course, HuggingFace for giving me the opportunity to work full time on BLOOM-176B and later on IDEFICS-80B trainings.

Recently, I continued expanding my knowledge and experience while training models and building scalable training/inference systems at [Contextual.AI](https://contextual.ai/) and I'm grateful for that opportunity to Aman and Douwe.

I'd also like to thank the numerous [contributors](contributors.md) who have been making this text awesome and error-free.

## Contributing

If you found a bug, typo or would like to propose an improvement please don't hesitate to open an [Issue](https://github.com/stas00/ml-engineering/issues) or contribute a PR.


## License

The content of this site is distributed under [Attribution-ShareAlike 4.0 International](LICENSE-CC-BY-SA).


## Citation

```bibtex
@misc{bekman2024mlengineering,
  author = {Bekman, Stas},
  title = {Machine Learning Engineering Open Book},
  year = {2023-2026},
  publisher = {Stasosphere Online Inc.},
  journal = {GitHub repository},
  url = {https://github.com/stas00/ml-engineering}
}
```

## My repositories map

✔ **Machine Learning:**
 [ML Engineering Open Book](https://github.com/stas00/ml-engineering) |
 [ML ways](https://github.com/stas00/ml-ways) |
 [Porting](https://github.com/stas00/porting)

✔ **Guides:**
 [The Art of Debugging](https://github.com/stas00/the-art-of-debugging)

✔ **Applications:**
 [ipyexperiments](https://github.com/stas00/ipyexperiments)

✔ **Tools and Cheatsheets:**
 [bash](https://github.com/stas00/bash-tools) |
 [conda](https://github.com/stas00/conda-tools) |
 [git](https://github.com/stas00/git-tools) |
 [jupyter-notebook](https://github.com/stas00/jupyter-notebook-tools) |
 [make](https://github.com/stas00/make-tools) |
 [python](https://github.com/stas00/python-tools) |
 [tensorboard](https://github.com/stas00/tensorboard-tools) |
 [unix](https://github.com/stas00/unix-tools)


================================================
FILE: build/README.md
================================================
# Book Building

Important: this is still a WIP - it mostly works, but stylesheets need some work to make the pdf really nice. Should be complete in a few weeks.

This document assumes you're working from the root of the repo.

## Installation requirements

1. Install python packages used during book build
```
pip install -r build/requirements.txt
```

2. Download the free version of [Prince XML](https://www.princexml.com/download/). It's used to build the pdf version of this book.


## Build html

```
make html
```

## Build pdf

```
make pdf
```

It will first build the html target and then will use it to build the pdf version.

## Build epub

```
make epub
```

It will first build the html target and then will use it to build the epub version.


## Check links and anchors

To validate that all local links and anchored links are valid run:
```
make check-links-local
```

To additionally also check external links
```
make check-links-all
```
use the latter sparingly to avoid being banned for hammering servers.


## Move md files/dirs and adjust relative links


e.g. `slurm` => `orchestration/slurm`
```
src=slurm
dst=orchestration/slurm

mkdir -p orchestration
git mv $src $dst
perl -pi -e "s|$src|$dst|" chapters-md.txt
python build/mdbook/mv-links.py $src $dst
git checkout $dst
make check-links-local

```

## Resize images

When included images are too large, make them smaller a bit:

```
mogrify -format png -resize 1024x1024\> *png
```


================================================
FILE: build/linkcheckerrc
================================================
# rtfm https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html

[output]

[text]
colorwarning=blue

[AnchorCheck]

[filtering]
ignorewarnings=http-redirected,http-moved-permanent

[checking]
threads=20


================================================
FILE: build/mdbook/md-to-html.py
================================================
import argparse
import datetime
import re

from functools import partial
from markdown_it import MarkdownIt
from mdit_py_plugins.anchors import anchors_plugin
from pathlib import Path

from utils.github_md_utils import md_header_to_anchor, md_process_local_links, md_expand_links, md_convert_md_target_to_html
from utils.build_utils import get_markdown_files

mdit = (
    MarkdownIt('commonmark', {'breaks':True, 'html':True})
    .use(anchors_plugin, max_level=7, permalink=False, slug_func=md_header_to_anchor)
    .enable('table')
)

my_repo_url = "https://github.com/stas00/ml-engineering/blob/master"

def convert_markdown_to_html(markdown_path, args):
    md_content = markdown_path.read_text()

    cwd_rel_path = markdown_path.parent

    repo_url = my_repo_url if not args.local else ""
    md_content = md_process_local_links(md_content, md_expand_links, cwd_rel_path=cwd_rel_path, repo_url=repo_url)
    md_content = md_process_local_links(md_content, md_convert_md_target_to_html)

    #tokens = mdit.parse(md_content)
    html_content = mdit.render(md_content)
    # we don't want <br />, since github doesn't use it in its md presentation
    html_content = re.sub('<br />', '', html_content)

    html_file = markdown_path.with_suffix(".html")
    html_file.write_text(html_content)

def make_cover_page_file(cover_md_file, date):
    with open(cover_md_file, "w") as f:
        f.write(f"""
![](images/Machine-Learning-Engineering-book-cover.png)

## Machine Learning Engineering Open Book

This is an ebook version of [Machine Learning Engineering Open Book by Stas Bekman](https://github.com/stas00/ml-engineering/) generated on {date}.

As this book is constantly being updated, if you downloaded it as a pdf or an epub file and the date isn't recent, chances are that it's already outdated - make sure to check the latest version at [https://github.com/stas00/ml-engineering](https://github.com/stas00/ml-engineering/).
""")
    return Path(cover_md_file)

def write_html_index(html_chapters_file, markdown_files):
    html_chapters = [str(l.with_suffix(".html")) for l in markdown_files]
    html_chapters_file.write_text("\n".join(html_chapters))


if __name__ == "__main__":

    parser = argparse.ArgumentParser()
    parser.add_argument('--local',  action="store_true", help="all local files remain local")
    args = parser.parse_args()

    date = datetime.datetime.now().strftime("%Y-%m-%d")

    cover_md_file = "book-front.md"

    md_chapters_file = Path("chapters-md.txt")
    html_chapters_file = Path("chapters-html.txt")

    pdf_file = f"Stas Bekman - Machine Learning Engineering ({date}).pdf"

    markdown_files = [make_cover_page_file(cover_md_file, date)] + get_markdown_files(md_chapters_file)

    pdf_files = []
    for markdown_file in markdown_files:
        convert_markdown_to_html(markdown_file, args)

    write_html_index(html_chapters_file, markdown_files)


================================================
FILE: build/mdbook/mv-links.py
================================================
"""

when chapters are moved around this script rewrites local relative links

python build/mdbook/mv-links.py slurm orchestration/slurm

"""

import datetime
import re
import sys
from pathlib import Path

from utils.build_utils import get_markdown_files
from utils.github_md_utils import md_rename_relative_links, md_process_local_links


def rewrite_links(markdown_path, src, dst):
    md_content = markdown_path.read_text()

    cwd_rel_path = markdown_path.parent
    md_content = md_process_local_links(md_content, md_rename_relative_links, cwd_rel_path=cwd_rel_path, src=src, dst=dst)

    markdown_path.write_text(md_content)


if __name__ == "__main__":

    src, dst = sys.argv[1:3]

    print(f"Renaming {src} => {dst}")

    md_chapters_file = Path("chapters-md.txt")
    markdown_files = get_markdown_files(md_chapters_file)

    for markdown_file in markdown_files:
        rewrite_links(markdown_file, src=src, dst=dst)


================================================
FILE: build/mdbook/preprocess-html-for-epub.py
================================================
#!/usr/bin/env python3
"""
Preprocess HTML files for EPUB generation.
Makes all anchor IDs globally unique by prefixing with chapter identifier,
then updates all internal links to use the prefixed anchors.
"""

import re
import sys
from pathlib import Path

# Import from sibling module
sys.path.insert(0, str(Path(__file__).parent))


def file_to_prefix(filepath):
    """Generate a unique prefix from a file path."""
    # Convert path like 'network/benchmarks/README.html' to 'network-benchmarks-readme'
    path = Path(filepath)
    parts = list(path.parent.parts) + [path.stem.lower()]
    # Filter out empty parts and join with hyphen
    prefix = '-'.join(p for p in parts if p and p != '.')
    return prefix if prefix else 'root'


def prefix_anchors_in_file(html_file, prefix):
    """Add prefix to all id attributes in an HTML file."""
    path = Path(html_file)
    if not path.exists():
        return {}
    
    content = path.read_text(encoding='utf-8')
    original = content
    anchor_map = {}  # old_anchor -> new_anchor
    
    def replace_id(match):
        quote_start = match.group(1)  # 'id="' or "id='"
        old_id = match.group(2)
        quote_end = match.group(3)
        
        new_id = f"{prefix}--{old_id}"
        anchor_map[old_id] = new_id
        return f"{quote_start}{new_id}{quote_end}"
    
    content = re.sub(r'(id=["\'])([^"\']+)(["\'])', replace_id, content)
    
    if content != original:
        path.write_text(content, encoding='utf-8')
    
    return anchor_map


def update_links_in_file(html_file, file_prefix_map, file_anchor_maps):
    """Update all internal links to use prefixed anchors."""
    path = Path(html_file)
    if not path.exists():
        return
    
    content = path.read_text(encoding='utf-8')
    original = content
    current_dir = path.parent
    current_prefix = file_prefix_map.get(str(path), file_prefix_map.get(html_file, ''))
    
    # Build set of valid HTML targets
    html_file_set = set(file_prefix_map.keys())
    
    def convert_href(match):
        prefix = match.group(1)  # 'href="' or "href='"
        href = match.group(2)
        quote = match.group(3)
        
        # Skip external links and mailto
        if href.startswith(('http://', 'https://', 'mailto:')):
            return match.group(0)
        
        # Handle anchor-only links (within same file)
        if href.startswith('#'):
            old_anchor = href[1:]
            # Look up the prefixed anchor for this file
            anchor_map = file_anchor_maps.get(str(path), file_anchor_maps.get(html_file, {}))
            if old_anchor in anchor_map:
                return f'{prefix}#{anchor_map[old_anchor]}{quote}'
            return match.group(0)
        
        # Parse href into path and anchor
        if '#' in href:
            href_path, anchor = href.split('#', 1)
        else:
            href_path = href
            anchor = None
        
        # Skip non-HTML links
        if not href_path.endswith('.html'):
            return match.group(0)
        
        # Resolve relative path
        try:
            resolved = (current_dir / href_path).resolve()
            rel_path = str(resolved.relative_to(Path.cwd()))
        except (ValueError, OSError):
            return match.group(0)
        
        # Check if target is in our HTML file list
        if rel_path in html_file_set:
            target_prefix = file_prefix_map[rel_path]
            target_anchor_map = file_anchor_maps.get(rel_path, {})
            
            if anchor and anchor in target_anchor_map:
                # Link to specific anchor in target file
                return f'{prefix}#{target_anchor_map[anchor]}{quote}'
            else:
                # Link to file without anchor - find first heading anchor
                if target_anchor_map:
                    # Get the first anchor (usually the main heading)
                    # We'll use a naming convention: look for the file's main id
                    first_anchor = next(iter(target_anchor_map.values()), None)
                    if first_anchor:
                        return f'{prefix}#{first_anchor}{quote}'
        
        return match.group(0)
    
    # Match href="..." or href='...'
    content = re.sub(r'(href=["\'])([^"\']+)(["\'])', convert_href, content)
    
    if content != original:
        path.write_text(content, encoding='utf-8')
        print(f"Updated: {html_file}")


def main():
    chapters_file = Path("chapters-html.txt")
    if not chapters_file.exists():
        print("Error: chapters-html.txt not found")
        sys.exit(1)
    
    html_files = [line.strip() for line in chapters_file.read_text().split('\n') if line.strip()]
    
    if not html_files:
        print("Error: No HTML files found")
        sys.exit(1)
    
    print(f"Processing {len(html_files)} HTML files for EPUB...")
    
    # Step 1: Generate unique prefix for each file
    file_prefix_map = {}
    for html_file in html_files:
        prefix = file_to_prefix(html_file)
        file_prefix_map[html_file] = prefix
        file_prefix_map[str(Path(html_file))] = prefix
    
    # Step 2: Prefix all anchors in each file and build anchor maps
    file_anchor_maps = {}
    for html_file in html_files:
        prefix = file_prefix_map[html_file]
        anchor_map = prefix_anchors_in_file(html_file, prefix)
        file_anchor_maps[html_file] = anchor_map
        file_anchor_maps[str(Path(html_file))] = anchor_map
    
    # Step 3: Update all links to use prefixed anchors
    for html_file in html_files:
        update_links_in_file(html_file, file_prefix_map, file_anchor_maps)
    
    print("Done preprocessing for EPUB")


if __name__ == "__main__":
    main()


================================================
FILE: build/mdbook/utils/build_utils.py
================================================
from pathlib import Path

def get_markdown_files(md_chapters_file):
    return [Path(l) for l in md_chapters_file.read_text().splitlines() if len(l)>0]


================================================
FILE: build/mdbook/utils/github_md_utils.py
================================================
"""
The utils in this module replicate github logic, which means it may or may not work for other markdown
"""

import re
from pathlib import Path

# matches ("Markdown text", Link) in [Markdown text](Link)
re_md_link_2_parts = re.compile(r"""
^
\[
([^]]+)
\]
\(
([^)]+)
\)
$
""", re.VERBOSE)

# matches one or more '[Markdown text](Link)' patterns
re_md_link_full = re.compile(r"""
(
\[
[^]]+
\]
\(
[^)]+
\)
)
""", re.VERBOSE|re.MULTILINE)

img_exts = ["jpg", "jpeg", "png"]
re_link_images = re.compile("(" + "|".join(img_exts) + ")", re.VERBOSE|re.MULTILINE|re.IGNORECASE)

cwd_abs_path = Path.cwd()

def md_is_relative_link(link):
    # skip any protocol:/ based links - what remains should be a relative local links - relative to
    # the root of the project or to any of the local pages
    if ":/" in link:
        return False
    return True

def md_process_local_links(para, callback, **kwargs):
    """
    parse the paragraph to detect local markdown links, process those through callback and put them
    back into the paragraph and return the result
    """
    return re.sub(re_md_link_full,
                  lambda x: callback(x.group(), **kwargs) if md_is_relative_link(x.group()) else x.group(),
                  para)


def md_link_break_up(text):
    """
    text = [Markdown text](Link.md)
    returns ("markdown text", "link.md", None)

    text = [Markdown text](Link.md#bar)
    returns ("markdown text", "link.md", "bar")

    text = [Markdown text](Link/#bar)
    returns ("markdown text", "link/", "bar")
    """
    match = re.findall(re_md_link_2_parts, text)
    if match:
        link_text, full_link = match[0]

        # split full_link into link and anchor parts
        link_parts = full_link.split("#")
        link = link_parts[0]
        anchor = link_parts[1] if len(link_parts)==2 else None
        return (link_text, link, anchor)
    else:
        raise ValueError(f"invalid md link markup: {text}")


def md_link_build(link_text, link, anchor=None):
    """
    returns [link_text](link)
    """
    full_link = link
    if anchor is not None:
        full_link += f"#{anchor}"

    return f"[{link_text}]({full_link})"


def resolve_rel_link(link, cwd_rel_path):
    """ resolves all sorts of ./, ../foobar and returns a relative to the repo root relative link
    this is useful if a repo url needs to be prepended

    XXX: it assumes the program is run from the root of the repo
    """
    link = (Path(cwd_rel_path) / Path(link)).resolve().relative_to(cwd_abs_path)
    return str(link)

def md_expand_links(text, cwd_rel_path, repo_url=""):
    """
    Perform link rewrites as following:
    - return unmodified if the link:
       * is empty (same doc internal anchor)
       * ends in .md (well defined)
       * is remote - i.e. contains protocol :// return unmodified
    - convert relative link shortcuts into full links, e.g. s#chapter/?#chapter/README.md#
    - if the local link is not for .md or images, it's not going to be in the pdf, so resolve it and point
      to its url at the the repo

    """
    link_text, link, anchor = md_link_break_up(text)

    #print(link_text, link, anchor)

    # skip:
    # - empty links (i.e. just local anchor to the same doc)
    # - skip explicit .md links
    # - external links like https://...
    if len(link) == 0 or link.endswith(".md") or re.search(r'^\w+://', link):
        return text

    link = Path(link)
    try_link = link / "README.md"

    full_path = cwd_rel_path / try_link
    if full_path.exists():
        link = str(try_link)
    else:
        link = str(link)

        if repo_url != "":
            # leave the images local for pdf rendering, but for the rest of the file (scripts,
            # reports, etc.)
            # prepend the repo base url, while removing ./ relative prefix if any
            if not re.search(re_link_images, link):
                link = resolve_rel_link(link, cwd_rel_path)
                link = repo_url + "/" + link

    return md_link_build(link_text, link, anchor)


def md_rename_relative_links(text, cwd_rel_path, src, dst):
    """
    Perform link rewrites as following:
    - if the link contains protocol :// do nothing
    XXX: complete me when finished

    """
    link_text, link, anchor = md_link_break_up(text)

    # skip:
    # - empty links (i.e. just local anchor to the same doc)
    # - external links like https://...
    if len(link) == 0 or re.search(r'^\w+://', link):
        return text

    print(link_text, link, anchor)
    print(cwd_rel_path, src, dst)

    print("INCOMING ", link)

    full_path = str(cwd_rel_path / link)
    print("FULL ORIG", full_path)

    if str(cwd_rel_path) == ".":
        # top-level
        new_path = re.sub(rf"^{src}", dst, full_path)
        print("TOP   NEW", new_path)
    else:
        # sub-dir - to ensure we rewrite with leading / only
        new_path = re.sub(rf"/{src}", f"/{dst}", full_path)
        print("SUB   NEW", new_path)

    prefix = rf"^{cwd_rel_path}/" if str(cwd_rel_path) != "." else ""

    # did it not get modified?
    if full_path == new_path:
        # do nothing if there was no rewrite
        return text
    else:
        # if it got modified then undo the prepending of cwd_rel_path
        print("SHORT NEW", new_path)
        new_path = re.sub(prefix, "", new_path)


    # strip the prefix second time if it was also part of the rename
    #new_path = re.sub(prefix, "", new_path)

    print("FINAL   ", new_path)

    link = new_path

    #return text



    return md_link_build(link_text, link, anchor)



def md_convert_md_target_to_html(text):
    """
    convert .md target to .html target

    - chapter/doc.md => chapter/doc.html
    """
    link_text, link, anchor = md_link_break_up(text)
    link = re.sub("\.md$", ".html", link)
    return md_link_build(link_text, link, anchor)


def md_header_to_anchor(text):
    """
    Convert "#" headers into anchors
    # This is title => this-is-title
    """
    orig_text = text
    # lowercase
    text = text.lower()
    # keep only a subset of chars
    text = re.sub(r"[^-_a-z0-9\s]", r"", text, flags=re.IGNORECASE)
    # spaces2dashes
    text = re.sub(r"\s", r"-", text, flags=re.IGNORECASE)
    # leading/trailing cleanup
    text = re.sub(r"(^-+|-+$)", r"", text, flags=re.IGNORECASE)

    return text

def md_header_to_md_link(text, link=''):
    """
    Convert "#" headers into an md link

    # This is title => [This is title](link#this-is-title)

    if `link` is not passed or it's "" it'll generate a local anchored link
    """
    anchor = md_header_to_anchor(text)
    return f"[{text}]({link}#{anchor})"

if __name__ == "__main__":

    # # run to test some of these utils
    # para = 'bb [Markdown text](foo.md#tar) aaa bb [Markdown text2](foo/#bar) aaa [Markdown text3](http://ex.com/foo/#bar)'
    # print(para)
    # para = md_process_local_links(para, md_expand_links, cwd_rel_path=".")
    # print(para)

    # para = 'bb [Part 1](../Part1/) [Part 1](../Part1) [Local](#local) ![image](image.png)'
    # print(para)
    # para = md_process_local_links(para, md_expand_links, cwd_rel_path=".")
    # print(para)


    para = 'bb [Markdown text](foo.md#tar) aaa bb [Markdown text2](foo/#bar) aaa [Markdown text3](../foo/bar)'
    print(para)
    para = md_process_local_links(para, md_rename_relative_links, cwd_rel_path=Path("."), src="foo", dst="tar")
    print(para)


================================================
FILE: build/prince_style.css
================================================
/*
  CSS style sheet for prince html2pdf system (http://www.princexml.com/)

  Here's an example of how to use the style sheet:

  prince --no-author-style -s prince_style.css http://en.wikipedia.org/wiki/Winter_war -o foo.pdf
*/

@import url(http://www.princexml.com/fonts/gentium/index.css);

/* set headers and footers */

@page {
  size: letter;
  margin: 2cm 2cm;
  font: 11pt/1.3 "Gentium", serif;

/*
  @top-right {
    content: string(title);
    font-style: italic;
  }
  @top-left {
    content: string(source);
    font-style: italic;
  }
*/
  @bottom-center {
    content: counter(page);
    vertical-align: top;
    padding-top: 1em;
  }

  /* prince-shrink-to-fit: auto; */
}

/* #siteSub { string-set: source content() } */

/* basic style settings*/

body {
  font: 10pt/1.3 "Gentium", serif;
  prince-linebreak-magic: auto;
  hyphens: none;
  text-align: justify;
}

ul, ol, dl { text-align: left; hyphens: manual; }

chapter {
  page-break-before: always;
  prince-bookmark-level: 1;
  prince-bookmark-label: attr(title);
}

h1 { page-break-before: always; }

h1, h2, h3, h4, h5, h6 {
  line-height: 1.2;
  padding: 0;
  margin: 0.7em 0 0.2em;
  font-weight: normal;
  text-align: left;
  page-break-after: avoid;
  clear: both;
}

title { prince-bookmark-level: 1 }
h1 { prince-bookmark-level: 1 }
h2 { prince-bookmark-level: 2 }
h3 { prince-bookmark-level: 3 }
h4 { prince-bookmark-level: 4 }
h5 { prince-bookmark-level: 5 }
h6 { prince-bookmark-level: 6 }

/* a { text-decoration: none; color: inherit; } */

p {
  padding: 4px 0;   /* top & bottom, right & left */
  margin: 0;
}

/* blockquote p { */
/*   font-size: 1em; */
/*   font-style: italic; */
/* } */

blockquote {
  background: #f9f9f9;
  border-left: 10px solid #ccc;
  margin: 1.5em 10px;
  padding: 0.5em 10px;
}
blockquote p {
  display: inline;
}

code {
  font-family: Consolas, Menlo, Monaco, Lucida Console, Liberation Mono, DejaVu Sans Mono, Bitstream Vera Sans Mono, Courier New, monospace, serif;
  font-size: 0.8em; /* seems to be similar in size to the non-monospace font */
  background: #f9f9f9;
}

pre {
  background: #f9f9f9;
  margin: 1.5em 10px;
  padding: 0.5em 10px;
  white-space: pre-wrap; /* wrap long code sections to fit the page */
  hyphens: none; /* do not hyphenate code sections */
}

ol, ul {
  margin-top: 4px;
  margin-bottom: 4px;
  margin-left: 2em;
}
ul {  list-style-type: disc }


/* put article heading on top of the page, spanning all columns */

h1 {
  string-set: title content();
  padding-bottom: 0.2em;
  border-bottom: thin solid black;
  margin-bottom: 1em;
}


div {
  max-width: 100%
}

/* images */

/* this is important to fit huge images */
img {
  max-width: 650px;
}

tr, td, th {
  margin: 0;
/*  padding: 0.1em 0.2em; */
  text-align: left;
  vertical-align: top
}

div.center, th[align="center"] { text-align: center }

/* tables */

table {
  width: auto;
  border-collapse: collapse;
  border-bottom: thin solid black;
  margin: 1em 1em 2em 1em;
}
table, table td, table th {
  border: solid black .1px;
  padding: 0.4em;
  text-align: left;
}

table th { background: #eee; font-weight: bold}

/* hr { display: none } */

sup { vertical-align: baseline }
sup { vertical-align: top }

/* fix ' characters */
body { prince-text-replace: "'" "\2019" }


================================================
FILE: build/requirements.txt
================================================
codespell
linkchecker
markdown-it-py
mdit-py-plugins
pandoc


================================================
FILE: chapters-md.txt
================================================
README.md

insights/ai-battlefield.md
insights/how-to-choose-cloud-provider.md

compute/README.md
compute/accelerator/README.md
compute/accelerator/benchmarks/README.md
compute/accelerator/nvidia/debug.md
compute/accelerator/amd/debug.md
compute/accelerator/amd/performance.md
compute/cpu/README.md
compute/cpu-memory/README.md

storage/README.md
storage/benchmarks/results/hope-2023-12-20-14-37-02-331702-summary.md

network/README.md
network/comms.md
network/debug/README.md
network/benchmarks/README.md
network/benchmarks/results/README.md
network/benchmarks/results/disable-nvlink.md

orchestration/README.md
orchestration/slurm/README.md
orchestration/slurm/admin.md
orchestration/slurm/users.md
orchestration/slurm/performance.md
orchestration/slurm/launchers/README.md
orchestration/kubernetes/README.md

training/README.md
training/model-parallelism/README.md
training/performance/README.md
training/fault-tolerance/README.md
training/reproducibility/README.md
training/instabilities/README.md
training/instabilities/training-loss-patterns.md
training/checkpoints/README.md
training/hparams.md
training/dtype.md
training/emulate-multi-node.md
training/re-train-hub-models.md
training/datasets.md

inference/README.md

debug/README.md
debug/pytorch.md
debug/tools.md
debug/torch-distributed-hanging-solutions.md
debug/underflow_overflow.md
debug/make-tiny-models-tokenizers-datasets.md
debug/tiny-scripts/README.md

testing/README.md

resources/README.md

contributors.md

build/README.md


================================================
FILE: compute/README.md
================================================
# Compute

1. **[Accelerator](accelerator)** - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)

1. **[CPU](cpu)** - cpus, affinities (WIP)

1. **[CPU Memory](cpu-memory)** - how much CPU memory is enough - the shortest chapter ever.


================================================
FILE: compute/accelerator/README.md
================================================
# Accelerators

Compute accelerators are the workhorses of the ML training. At the beginning there were just GPUs. But now there are also TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs and more are being invented.

There exist two main ML workloads - training and inference. There is also the finetuning workload which is usually the same as training, unless a much lighter [LORA-style](https://arxiv.org/abs/2106.09685) finetuning is performed. The latter requires significantly fewer resources and time than normal finetuning.

In language models during inference the generation is performed in a sequence - one token at a time. So it has to repeat the same `forward` call thousands of times one smallish `matmul` (matrix multiplication or GEMM) at a time. And this can be done on either an accelerator, like GPU, or some of the most recent CPUs, that can handle inference quite efficiently.

During training the whole sequence length is processed in one huge `matmul` operation. So if the sequence length is 4k long, the training of the same model will require a compute unit that can handle 4k times more operations than inference and do it fast. Accelerators excel at this task. In fact the larger the matrices they have to multiply, the more efficient the compute.

The other computational difference is that while both training and inference have to perform the same total amount of `matmul`s in the `forward` pass, in the `backward` pass, which is only done for training, an additional 2x times of `matmul`s is done to calculate the gradients with regards to inputs and weights. And an additional `forward` is performed if activations recomputation is used. Therefore the training process requires at 3-4x more `matmul`s than inference.

## Subsections

General:
- [Benchmarks](benchmarks)

NVIDIA:
- [Troubleshooting NVIDIA GPUs](nvidia/debug.md)

AMD:
- [Troubleshooting AMD GPUs](amd/debug.md)
- [AMD GPUs Performance](amd/performance.md)

## Bird's eye view on the high end accelerator reality

While this might be changing in the future, unlike the consumer GPU market, as of this writing there aren't that many high end accelerators, and if you rent on the cloud, most providers will have more or less the same few accelerators to offer.

GPUs:
- As of today, ML clouds/HPCs already have B200s/B300s. GB200/GB300 and Rubin is expected to emerge in H2-2026.
- AMD's MI325X is now widely available on Tier 2 cloud providers. MI355X is starting to emerge. MI400X hopefully in 2026. New: large CSPs started to offer AMD GPUs

HPU:
- Intel's Gaudi2 and Gaudi3 are available at Intel's cloud.
- Falcon Shores is to replace Gaudi in 2025 - update - the project has been cancelled
- Jaguar Shores is to replace Falcon Shores in 2026-2027

IPU:
- Graphcore with their IPU offering was briefly available at Paperspace, but it's gone now. I'm not sure if anybody offers those.

TPU:
- Google's TPUs are, of course, available but they aren't the most desirable accelerators because you can only rent them, and the software isn't quite easily convertible between GPUs and TPUs, and so many (most?) developers remain in the GPU land, since they don't want to be locked into a hardware which is a Google monopoly.
- Amazon's Trainium2 is very similar to the TPU architecture and is available on AWS

On Pods and racks:
- Cerebras' WaferScale Engine (WSE)
- SambaNova's DataScale
- dozens of different pod and rack configs that compose the aforementioned GPUs with super-fast interconnects.

That's about it as of Q5-2025.

The rest of this document will compare most of the above in details and if you want to read the specs please head [here](#high-end-accelerators-for-ml-workloads).

As most of us rent the compute, and we never see what it looks like, here is how an 8xH100 node looks like physically (this is the GPU tray of the Dell PowerEdge XE9680 Rack Server):

![nvidia-a100-spec](images/8x-H100-node-Dell-PowerEdge-XE9680.png)


## Glossary

- CPU: Central Processing Unit
- FMA: Fused Multiply Add
- FPGA: Field Programmable Gate Arrays
- GCD: Graphics Compute Die
- GPU: Graphics Processing Unit
- HBM: High Bandwidth Memory
- HPC: High-performance Computing
- HPU: Habana Gaudi AI Processor Unit
- IPU: Intelligence Processing Unit
- MAMF: Maximum Achievable Matmul FLOPS
- MME: Matrix Multiplication Engine
- QPU: Quantum Processing Unit
- RDU: Reconfigurable Dataflow Unit
- TBP: Total Board Power
- TDP: Thermal Design Power or Thermal Design Parameter
- TGP: Total Graphics Power
- TPU: Tensor Processing Unit

[Additional glossary @ Modal](https://modal.com/gpu-glossary)

## The most important thing to understand

I will make the following statement multiple times in this book - and that it's not enough to buy/rent the most expensive accelerators and expect a high return on investment (ROI).

The two metrics for a high ROI for ML training are:
1. the speed at which the training will finish, because if the training takes 2-3x longer than planned, your model could become irrelevant before it was released - time is everything in the current super-competitive ML market.
2. the total $$ spent to train the model, because if the training takes 2-3x longer than planned, you will end up spending 2-3x times more.

Unless the rest of the purchased/rented hardware isn't chosen carefully to match the required workload chances are very high that the accelerators will idle a lot and both time and $$ will be lost. The most critical component is [network](../../network), then [storage](../../storage/), and the least critical ones are [CPU](../cpu) and [CPU memory](../cpu-memory) (at least for a typical training workload where any CPU limitations are compensated with multiple `DataLoader` workers).

If the compute is rented one usually doesn't have the freedom to choose - the hardware is either set in stone or some components might be replaceable but with not too many choices. Thus there are times when the chosen cloud provider doesn't provide a sufficiently well matched hardware, in which case it's best to seek out a different provider.

If you purchase your servers then I recommend to perform a very indepth due diligence before buying.

Besides hardware, you, of course, need software that can efficiently deploy the hardware.

We will discuss both the hardware and the software aspects in various chapters of this book. You may want to start [here](../../training/performance) and [here](../../training/model-parallelism).



## What Accelerator characteristics do we care for

Let's use the NVIDIA A100 spec as a reference point in the following sections.

![nvidia-a100-spec](images/nvidia-a100-spec.png)

[source](https://www.nvidia.com/en-us/data-center/a100/)

### TFLOPS

As mentioned earlier most of the work that ML training and inference do is matrix multiplication. If you remember your algebra matrix multiplication is made of many multiplications followed by summation. Each of these computations can be counted and define how many of these operations can be performed by the chip in a single seconds.

This is one of the key characteristics that the accelerators are judged by. The term TFLOPS defines how many trillions of FloatingPointOperations the chip can perform in a second. The more the better. There is a different definition for different data types. For example, here are a few entries from the theoretical peak TFLOPS from [A100 spec](https://www.nvidia.com/en-us/data-center/a100/):

| Data type \ TFLOPS     | w/o Sparsity | w/ Sparsity |
| :--------------------  | -----------: | ----------: |
| FP32                   |         19.5 |         n/a |
| Tensor Float 32 (TF32) |          156 |         312 |
| BFLOAT16 Tensor Core   |          312 |         624 |
| FP16 Tensor Core       |          312 |         624 |
| FP8 Tensor Core        |          624 |        1248 |
| INT8 Tensor Core       |          624 |        1248 |

Notes:

* INT8 is measured in TeraOperations as it's not a floating operation.

* the term FLOPS could mean either the total number of FloatingPointOperations, e.g. when counting how many FLOPS a single Transformer iteration takes, and it could also mean FloatingPointOperations per second - so watch out for the context. When you read an accelerator spec it's almost always a per second definition. When model architectures are discussed it's usually just the total number of FloatingPointOperations.

So you can see that int8 is 2x faster than bf16 which in turn is 2x faster than tf32.

Moreover, the TFLOPs depend on the matrices size as can be seen from this table:

![nvidia-a100-matmul-tflops](images/nvidia-a100-matmul-tflops.png)

[source](https://developer.nvidia.com/blog/cuda-11-features-revealed/)

As you can see the difference in performance is non-linear due to [the tile and wave quantization effects](../../training/performance#tile-and-wave-quantization). Note the blue line in the graph corresponds to FP32 Tensor Core.

#### How To Calculate Theoretical TFLOPS

Theoretical peak FLOPS is what gets published on the accelerator's spec. And it's calculated as:

`Theoretical FLOPS = compute_unit_clock_speed * FLOPs_per_clock_cycle_per_compute_unit * num_compute_units`

where:
- `compute_unit_clock_speed` - how many times the compute unit clock ticks per second in Hz
- `flops_per_clock_cycle_per_compute_unit` - the number of floating point operations the compute unit can execute per clock cycle.
- `num_compute_units` - how many units there is in the device

FLOPs per clock cycle per compute unit is usually not published, but what one often finds is the FMAs per clock cycle per compute unit specs. FMA is Fused Multiply Add. And since 1 FMA is made of 2 FLOPs, we can expand the above formula to:

`Theoretical FLOPS = compute_unit_clock_speed * FMAs_per_clock_cycle_per_compute_unit * 2 * num_compute_units`

Let's validate that this formula checks out. Let's compute some BF16 (half precision) TFLOPS and compare to the published specs.

First, let's extract the necessary accelerator specs from [wiki](https://en.wikipedia.org/wiki/Hopper_(microarchitecture)#H100_accelerator_and_DGX_H100).

The tricky part was to find the FMAs ops per Tensor Core per clock cycle for BF16 (half precision). I found them [here](https://forums.developer.nvidia.com/t/how-to-calculate-the-tensor-core-fp16-performance-of-h100/244727/2). Most are coming from the [A100 whitepaper](https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf) (search the pdf for "FMA" and then choose the ones listed for the target precision you're after). The [H100 whitepaper](https://resources.nvidia.com/en-us-tensor-core) omitted a lot of specific FMA numbers, but included the multipliers wrt FMAs listed in the A100 whitepaper).


**For NVIDIA @ BF16**:

For NVIDIA BF16 operations are performed by Tensor cores.

| Accelerator | Boost Clock | FMAs ops per Tensor Core per clock cycle | Tensor Cores | Spec TFLOPS |
| :---------  | ---------:  | ---------------------------------------: | -----------: | ----------: |
| H100 SXM    | 1980Mhz     |                                      512 |          528 |         989 |
| A100 SXM    | 1410MHz     |                                      256 |          432 |         312 |

Now let's do the math, by inserting the numbers from the table above into the last FMA-based formula:

- `1980*10**6 * 512 * 2 * 528 / 10**12 = 1070.530` TFLOPS
- `1410*10**6 * 256 * 2 * 432 / 10**12 = 311.87` TFLOPS

The calculated A100 SXM TFLOPS number matches the published 312 TFLOPS, but H100 SXM is slightly off (some 80 points higher than the spec) - most likely when its theoretical specs were calculated a lower boost clock speed was used. We can reverse engineer what it was using the spec TFLOPS: `989 / (512 * 2 * 528 / 10**12) / 10**6 = 1829.20`. Indeed some Internet articles publish 1830Mhz as the actual boost clock speed of H100 SXM.

**For AMD @ BF16**:

| Accelerator | Boost Clock | FMAs ops per Tensor Core per clock cycle | Tensor Cores | Spec TFLOPS |
| :---------  | ---------:  |   -------------------------------------: |  ----------: | ----------: |
| MI300X      | 2100Mhz     |                                      256 |         1216 |        1307 |

Let's calculate ourselves as before:

- `2100*10**6 * 256 * 2 * 1216 / 10**12 = 1307.4` TFLOPS - matches the published spec, even though most of the time you will see the rounded down `1300` TFLOPS in the literature.

**For Intel @ BF16**:

Intel Gaudi uses MMEs to do BF16 `matmul`

| Accelerator | Boost Clock | FMAs ops per MME per clock cycle | MMEs  | Spec TFLOPS |
| :---------  | ---------:  | -------------------------------: | ----: | ----------: |
| Gaudi 2     | 1650Mhz     | 256*256                          | 2     | 432         |
| Gaudi 3     | 1600Mhz     | 256*256                          | 8     | 1677        |
|             |             |                                  |       |             |

Let's calculate ourselves as before:

- Gaudi 2: `1650*10**6 * 256*256 * 2 * 2 / 10**12 = 432.5` TFLOPS - matches the published spec
- Gaudi 3: `1600*10**6 * 256*256 * 2 * 8 / 10**12 = 1677` TFLOPS - note that this doesn't match the published spec in the whitepaper (1835 TFLOPS), because in order to have 1835 TFLOPS the clock has to be 1750Mhz. i.e. the current incarnation of Gaudi3 is running at 1600Mhz.

It should become obvious now that if your accelerator runs at a lower boost clock than the spec (e.g. overheating that leads to accelerator throttling) the expected TFLOPS will be lower than advertised.

To check the actual clock speed when your accelerator is under load see the [clock speed](#clock-speed) section.




#### TFLOPS comparison table

Let's look at the supported [dtypes](../../training/dtype.md) and the corresponding theoretical peak TFLOPS specs across the high end accelerators (w/o sparsity). Sorted by the bf16 column.

| Accelerator \ TFLOPS  | fp32  | tf32   | fp16 | bf16 |  fp8  | int8 |  fp6  | fp4    | nvfp4   | Notes |
| :-------------------- | ----: | -----: | ---: | ---: | ----: | ---: | ----: | -----: | ------: | ----: |
| AMD MI450X            |       |        |      |      | 20000 |      |       | 40000  |         |       |
| NVIDIA Rubin SXM      | 130.0 | 2000.0 | 4000 | 4000 | 17500 | 2500 | 17500 | 35000  | 35000   |       |
| NVIDIA GB300 SXM      | 80.0  | 1250.0 | 2500 | 2500 | 5000  | 5000 | 5000  | 15000  | ?       | 10    |
| NVIDIA GB200 SXM      | 80.0  | 1250.0 | 2500 | 2500 | 5000  | 5000 | 5000  | 10000  | ?       | 2     |
| Google TPU v7x        | ?     | ?      | 2307 | 2307 | 4614  | ?    | ?     | ?      | ?       |       |
| AMD MI355X            | 157.3 | ?      | 2300 | 2300 | 4600  | 4600 | 9200  | 9200   | X       |       |
| NVIDIA B300 SXM       | 80.0  | 1125.0 | 2250 | 2250 | 4500  | 4500 | 4500  | 12600  | 15000   |       |
| NVIDIA B200 SXM       | 80.0  | 1125.0 | 2250 | 2250 | 4500  | 4500 | 4500  | 9000   | 10000   |       |
| Intel Gaudi3          | 229.0 | 459.0  | 459  | 1677 | 1677  | V    | X     | X      | X       | 1,8   |
| AMD MI325X            | 163.4 | 653.7  | 1300 | 1300 | 2600  | 2600 | X     | X      | X       | 7     |
| AMD MI300X            | 163.4 | 653.7  | 1300 | 1300 | 2600  | 2600 | X     | X      | X       |       |
| NVIDIA H200 SXM       | 67.0  | 494.5  | 989  | 989  | 1979  | 1979 | X     | X      | X       | 4     |
| NVIDIA H100 SXM       | 67.0  | 494.5  | 989  | 989  | 1979  | 1979 | X     | X      | X       | 3     |
| NVIDIA GH200 SXM      | 67.0  | 494.5  | 989  | 989  | 1979  | 1979 | X     | X      | X       | 6     |
| Google TPU v6e        |  ?    |  ?     |  918 | 918  | 918   | 1836 | X     | X      | X       |       |
| NVIDIA H100 PCIe      | 51.0  | 378.0  | 756  | 756  | 1513  | 1513 | X     | X      | X       |       |
| AWS Trainium2 / Ultra | 181.0 | 667.0  | 667  | 667  | 1299  | X    | X     | X      | X       | 9     |
| Google TPU v5p        | X     | X      | X    | 459  | X     | 918  | X     | X      | X       |       |
| Intel Gaudi2          | V     | V      | V    | 432  | 865   | V    | X     | X      | X       | 1     |
| AMD MI250X            | 47.9  | X      | 383  | 383  | X     | 383  | X     | X      | X       |       |
| NVIDIA L40S           | 91.6  | 183.0  | 362  | 362  | 733   | 733  | X     | X      | X       |       |
| AMD MI250             | 45.3  | X      | 362  | 362  | X     | 362  | X     | X      | X       |       |
| NVIDIA A100 SXM       | 19.5  | 156.0  | 312  | 312  | X     | 624  | X     | X      | X       |       |
| NVIDIA A100 PCIe      | 19.5  | 156.0  | 312  | 312  | X     | 624  | X     | X      | X       | 5     |
| Google TPU v4         | X     | X      | X    | 275  | X     | 275  | X     | X      | X       |       |
| Google TPU v5e        | X     | X      | X    | 197  | X     | 394  | X     | X      | X       |       |
|                       |       |        |      |      |       |      |       |        |         |       |

Row-specific notes:

1. Intel Gaudi2 and 3 only have partial TFLOPS [specs](https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html) published, but it does support FP32, TF32, BF16, FP16 & FP8, INT8 and INT16. These numbers are for MME (Matrix) compute.

2. Since GB200 is 2x B200 chips the table includes TFLOPS per chip for a fair comparison - you'd 2x it for the real GB200 - it also seems to run the B200 chips a bit faster so higher specs than standalone B200. This also means that instead of your typical 8-GPU node, with GB200 you will get a 4-GPU node instead (but it'd be the equivalent of 8x B200 w/ an additional ~10% faster compute).

3. I didn't include `NVIDIA H100 dual NVL` as it's, well, 2x GPUs - so it won't be fair - it's the same FLOPS as H100 but 2x everything, plus at has a bit more memory (94GB per chip, as compared to 80GB H100) and the memory is a bit faster.

4. H200 is the same as H100 but has 141GB vs 80GB of HBM memory, and its memory is faster, HBMe@4.8TBps vs HBM@3.35TBps - so basically H200 solves the compute efficiency issues of H100.

5. Oddly NVIDIA A100 PCIe and SXM revisions [spec](https://www.nvidia.com/en-us/data-center/a100/) are reported to have the same TFLOPS, which is odd considering the SXM version uses 30% more power and uses a 5% faster HBM.

6. GH200 - same note as GB200 - this is 2 GPUs in one package, so the table includes specs per chip w/o sparsity.

7. MI325X is the same compute as MI300X, but has more memory and more power (more efficient compute).

8. Gaudi3 as of this writing is running at 1600Mhz (MME) and not the planned 1750Mhz, therefore its BF16 TFLOPS are 1677 and not 1835 as per whitepaper spec. Same goes for fp8 which runs at the same TFLOPS as BF16.

9. Trainium2 also supports FP8/FP16/BF16/TF32 @ 2563 TFLOPS w/ 4:1 sparsity

10. GB200 NVL72 and GB300 NVL72 seem to be the same but faster fp4 and more memory for the latter.

General notes:

* int8 is measured in TeraOperations as it's not a floating operation.

* if you find numbers that are double of the above - it usually means with sparsity (which at the moment almost nobody can benefit from as our matrices are dense).

* when looking at specs be very careful at which numbers you're reading - many vendors often publish TFLOPS with sparsity, as they are ~2x bigger, but if they even indicate this they often do it in small print. I had to ask NVIDIA to add a note to their H100 spec that those numbers were w/ sparsity as they originally didn't mention this important technical fact. And 99% of the time as of this writing you will be not using sparsity and thus the actual theoretical TFLOPs that you care for most of the time are w/o sparsity (i.e. the table above).

* also beware that if accelerator A publishes a higher TFLOPS than accelerator B, it doesn't mean A is faster. These are theoretical numbers which not only can never be achieved in practice - the actual TFLOPS efficiency [Hardware FLOPS Utilization](../../training/performance#mfu-vs-hfu) (HFU) can vary a lot from vendor to vendor or even for the same vendor's different accelerator architectures.



#### Maximum Achievable FLOPS

The problem with the advertised theoretical peak FLOPS is that they are **very** theoretical and can't be achieved in practice even if all the perfect conditions have been provided. Each accelerator has its own realistic FLOPS which is not advertised and there are anecdotal community reports that do their best to find the actual best value, but I'm yet to find any official reports.

If you find solid reports (papers?) showing the actual TFLOPS one can expect from one or more of the high end accelerators discussed in this chapter please kindly submit a PR with this information. The key is to have a reference to a source that the reader can validate the proposed information with.

To provide a numerical sense to what I'm talking about let's take an A100 with its 312 TFLOPS bf16 peak performance in the specs of this card. Until the invention of FlashAttention it was known that 150 TFLOPS was close to the highest one could get for fp16/bf16 mixed precision training regime. And with FlashAttention it's around 180+TFLOPS. This is, of course, measured for training LLMs where the network and IO are involved which create additional overheads. So here the maximum achievable peak performance probably lays somewhere between 200 and 300 TFLOPS.

You could measure the the actual achievable peak TFLOPS by doing a perfectly aligned max-size matrices `matmul` measured on a single accelerator. You can use [Maximum Achievable Matmul FLOPS Finder](benchmarks#maximum-achievable-matmul-flops-finder) to reproduce the results. But, of course, this will only tell you how well your given accelerator and its software stack do `matmul` - depending on the workload this might be all you need to know, or not.

MAMF stands for [Maximum Achievable Matmul FLOPS](#maximum-achievable-matmul-flops-comparison-table), which is a term coined by yours truly. It is very practical for those who do performance optimization work.

#### Maximum Achievable Matmul FLOPS comparison table

The following measurements are for `matmul` with BF16 and FP8 inputs (no sparsity) TFLOPS (see above for what MAMF stands for). Reporting a mean of 100 iterations after 50 warmup iterations for each shape. Sorted by accelerator efficiency:

**BF16**:

| Accelerator      |   MAMF | Theory | Efficiency |  Best Shape MxNxK | torch version                  | Notes                              |
| :--------------- | -----: | -----: | ---------: | :---------------- | :----------------------------- | :--------------------------------- |
| Intel Gaudi 2    |  418.7 |    432 |      96.9% |  14336x15360x2048 | 2.6.0+hpu_1.21.2-76.gitabf798b | PT_HPU_LAZY_MODE=1                 |
| NVIDIA A100 SXM  |  271.2 |    312 |      86.9% |   1024x10240x5120 | 2.6.0+cu126                    |                                    |
| NVIDIA GH200 SXM |  828.6 |    989 |      83.6% |   1024x15360x4096 | 2.6.0+cu126                    | 900W 141GB HBM3e version           |
| NVIDIA A100 PCIe |  252.9 |    312 |      81.1% |    2048x5120x6144 | 2.5.1+cu124                    |                                    |
| NVIDIA H100 SXM  |  794.5 |    989 |      80.3% |   2048x2048x13312 | 2.7.0+cu126                    | H200 is the same                   |
| NVIDIA B300 SXM  | 1769.0 |   2250 |      78.6% |  12288x18432x1024 | 2.9.1+cu130                    | same as B200, newer torch/cuda     |
| NVIDIA B200 SXM  | 1745.0 |   2250 |      77.6% |   1792x16128x3072 | 2.7.1+cu128                    |                                    |
| Intel Gaudi 3    | 1243.0 |   1677 |      74.1% |    16384x4096x768 | 2.6.0+hpu_1.21.4-3.gitabf798b  | PT_HPU_LAZY_MODE=1                 |
| NVIDIA GB200 SXM | 1822.0 |   2500 |      72.9% |    4096x9728x2048 | 2.10.0.dev20250916+cu130       |                                    |
| AMD MI355X       | 1565.0 |   2300 |      68.0% |   12288x8192x8192 | 2.8.0+rocm7.0.2.git245bf6ed    | PYTORCH_TUNABLEOP_ENABLED=0        |
| AMD MI325X       |  784.9 |   1300 |      60.4% |  13312x10240x8192 | 2.6.0+6.2.4                    | PYTORCH_TUNABLEOP_ENABLED=1, 1000W |
| AMD MI300X       |  668.4 |   1300 |      51.4% |  10240x15360x8192 | 2.5.1+6.3.42131                | PYTORCH_TUNABLEOP_ENABLED=1        |
|                  |        |        |            |                   |                                |                                    |


**FP8 (`float8_e4m3fn`)**:

| Accelerator      |   MAMF | Theory | Efficiency |  Best Shape MxNxK | torch version                  | Notes                    |
| :--------------- | -----: | -----: | ---------: | :---------------- | :----------------------------- | :----------------------- |
| Intel Gaudi 2    |  826.5 |    865 |      95.5% |   6144x11264x5120 | 2.6.0+hpu_1.21.2-76.gitabf798b | PT_HPU_LAZY_MODE=1       |
| NVIDIA GH200 SXM | 1535.0 |   1979 |      77.6% |  1024x14336x14336 | 2.6.0+cu126                    | 900W 141GB HBM3e version |
| Intel Gaudi 3    | 1289.5 |   1677 |      76.9% |   16640x1536x3072 | 2.6.0+hpu_1.21.4-3.gitabf798b  | PT_HPU_LAZY_MODE=1       |
| NVIDIA B200 SXM  | 3432.5 |   4500 |      76.3% |   15360x4096x3072 | 2.7.1+cu128                    |                          |
| NVIDIA B300 SXM  | 3353.3 |   4500 |      74.5% |    3072x6144x7168 | 2.9.1+cu130                    |                          |
| NVIDIA H200 SXM  | 1453.4 |   1979 |      73.4% |   1280x4096x12032 | 2.7.1+cu128                    |                          |
| NVIDIA GB200 SXM | 3615.6 |   5000 |      72.3% |   19456x5120x1536 | 2.10.0.dev20250916+cu130       |                          |
| NVIDIA H100 SXM  | 1402.6 |   1979 |      70.9% |   1024x9216x14336 | 2.7.0+cu126                    |                          |
| AMD MI300X       |        |   2600 |            |                   |                                |                          |
|                  |        |        |            |                   |                                |                          |


Caveat emptor: these numbers were achieved by a brute-force search of a non-exhaustive sub-space of various shapes performing `matmul`. See:  [Maximum Achievable Matmul TFLOPS Finder](benchmarks#maximum-achievable-matmul-flops-finder) using the software components available at the time of taking the measurement, so I highly recommend you re-run `mamf-finder.py` on your particular setup to get the true to your setup numbers. The numbers in this table are a rough estimation and shouldn't be used as absolute. As the software improves these numbers will improve coming closer to the theoretical spec. So ideally they ought to be re-run every 6 months or so.

Notes:
- For the full set of theoretical ones see [Theoretical accelerator TFLOPS](#tflops-comparison-table)
- Efficiency is MAMF/Theory*100
- While `mean` is probably what most users are interested in, the script reports `max`, `median` and `mean` - should you want the other numbers.
- Best shape is the one detected by the script, but there could be many others with similar performance - it's listed for reproducibility
- If you get a much lower performance than the numbers in this table, check that the target hardware has an adequate cooling, if the accelerator is overheated it'd usually throttle its performance down. And, of course, the assumption here is that the power supply matches the spec. The latter is rarely a problem in data centers, but bad cooling is not unheard of.
- Which software you use can make a huge difference - e.g., with MI300X I clocked 450TFLOPS using ROCm-6.1, but as you can see there was a dramatic improvement in ROCm-6.2 where it jumped a whooping additional 300 TFLOPS up. BLAS library type/version may have a big impact as well.
- Then there are various system optimizations - e.g. in the case of MI300X disabling numa_balancing in the kernel settings is a must.
- AMD MI250X has 2 GCDs - so the theoretical TFLOPS needs to be halved, as a single matmul uses only 1 of them and 383 TFLOPS is reported for 2 GCDs.

Also it's important to understand that knowing the Maximum Achievable Matmul TFLOPS at some particular shape like `4352x3840x13568` doesn't mean you can expect to get the same performance in your real application because chances are low that you will ever hit that exact shape. Instead, to know your system well, you'd run the [MAMF Finder](benchmarks#maximum-achievable-matmul-flops-finder) with the actual shapes your model is using during its training. This really is the main intention of this tool. You will have a good sense of when you can stop optimizing by comparing the TFLOPS reported by your training to Maximum Achievable MatMul TFLOPS you measured on your specific accelerator cluster.

And to conclude this section I'd like to repeat again that **the intention here is not to point fingers at which accelerator is less efficient than another, but to give a sense of what's what and how to navigate those theoretical specs and to help you understand when you need to continue optimizing your system and when to stop. So begin with these notes and numbers as a starting point, then measure your own use case and use that latter measurement to gain the best outcome.**

update: this new metric is starting to catch on. AMD published this graph and [explanations of why the efficiency of accelerators is going down as they get faster](https://rocm.blogs.amd.com/software-tools-optimization/Understanding_Peak_and_Max-Achievable_FLOPS/README.html):

![maf-nvidia-amd-efficiency.png](images/maf-nvidia-amd-efficiency.png)

[source](https://rocm.blogs.amd.com/software-tools-optimization/Understanding_Peak_and_Max-Achievable_FLOPS/README.html)


#### Not all accelerators are created equal

While measuring how well an accelerator performs, you need to be aware that while it gives you the ballpark performance numbers, other accelerators are likely to perform slightly differently. I have seen 5% and higher differences on an 8-gpu node.

This partially has to do with manufacturing processes, how well each accelerator is installed and much more about how equally each accelerator is cooled. For example, when air cooling is used it's very likely that the accelerators closer to the source of cooling will perform better than those further away, especially since now the hot air dissipated from one row gets blown into the next row of accelerators. Things should be better with liquid cooling.

Therefore, you want to measure the performance of all accelerators on the node and do it at the same time. For example, on NVIDIA nodes, if each benchmark measures a single accelerator, you could do:
```
CUDA_VISIBLE_DEVICES=0 ./some-benchmark
CUDA_VISIBLE_DEVICES=2 ./some-benchmark
...
CUDA_VISIBLE_DEVICES=7 ./some-benchmark
```

Now here what you want is the slowest performance as when used in an ensemble that slowest accelerator (struggler) will set the speed for all other accelerators.

If you do multi-node training then, of course, you'd want to measure them all.

So if you decide to calculate your achievable [MFU](../../training/performance#mfu-vs-hfu) (rather than theoretical one) you'd want to measure the achievable FLOPS across all participating accelerators and pick the value of the slowest accelerator. (If it really is an outlier you might want to consider replacing it as well).




### Accelerator memory size and speed

The accelerators use [High Bandwidth Memory](https://en.wikipedia.org/wiki/High_Bandwidth_Memory) (HBM) which is a 3D version of SDRAM memory. For example, A100-SXM comes with HBM2 at 1.6TBps, and H100-SXM comes with HBM3 at 3.35TBps (see the full table per accelerator below).

Here are the specs:

| Type  | Max data<br> rate speed per<br> pin (Gbps) | Stack<br> Height | Bits per<br> Channel | Number<br> of dies<br> per stack | Die capacity<br> per stack<br> (GiBs) | Max capacity<br> per stack<br> (GiBs) | Max data<br> rate per<br> stack (GBps) |
| :---- | --: | -: | --: | -: | -: | -: | ---: |
| HBM1  | 1.0 |  8 | 128 |  4 |  1 |  4 |  128 |
| HBM2  | 2.4 |  8 | 128 |  8 |  1 |  8 |  307 |
| HBM2e | 3.6 |  8 | 128 | 12 |  2 | 24 |  461 |
| HBM3  | 6.4 | 16 |  64 | 12 |  2 | 24 |  819 |
| HBM3e | 9.8 | 16 |  64 | 16 |  3 | 48 | 1229 |
| HBM4  | 6.4 | 32 |  64 | 16 |  4 | 64 | 1638 |

Notes:

- While I was researching this table I found a wide variation of the above numbers. I think it's because either there were different implementations or the specs changed several times and different publications caught different specs. The table above comes from [wikipedia](https://en.wikipedia.org/wiki/High_Bandwidth_Memory).
- Since HBM is a stack of multiple DRAM chips, the *Stack Height* specifies how many chips are per device.

Beware that sometimes memory specs may not be very clear about what GB means. Sometimes it's GiB (`2**30`), but written as GB. At other times it's actually GB (`10**9`). For example, you will see references to NVIDIA B200 having 192GB, where they actually mean GB and not GiB, because 192GB is 180GiB. Whereas bandwidth (GBps, TBps) almost always means decimals (1GBps:`10**9`Bytes per sec). To convert GiB to GB: `x*2**30/10**9`. To convert from GB to GiB `x*10**9/2**30`. Most often the memory size will be in GiB (but `i` omitted), for bandwidth it'll be GB.

Typically, the more on-device memory an accelerator has, the better its performance. At any given time usually most of the model weights aren't being used as they wait for their turn to be processed and thus large memory allows more of the model to be on the accelerator memory and immediately available for access and update. When there is not enough memory, sometimes the model has to be split across multiple accelerators, or offloaded to CPU and/or disk.

Here are the memory specs for the recent high end accelerators (some aren't GA yet), sorted by memory size, then bandwidth:

| Accelerator           | Memory<br> (GiBs) | Type  | Peak<br>Bandwidth<br> (TBps) |
| :-------------------  | ----------------: | :---- |         -------------------: |
| AMD MI450X            |               432 | HBM4  |                        19.60 |
| NVIDIA Rubin SXM      |               288 | HBM4  |                        22.00 |
| NVIDIA B300 SXM       |               288 | HBM3e |                         8.00 |
| AMD MI355X            |               288 | HBM3e |                         8.00 |
| AMD MI325X            |               256 | HBM3e |                         6.00 |
| AMD MI300X            |               192 | HBM3  |                         5.30 |
| NVIDIA GB200 SXM      |               185 | HBM3e |                         8.00 |
| NVIDIA B200 SXM       |               180 | HBM3e |                         8.00 |
| NVIDIA GH200 SXM (2)  |               141 | HBM3e |                         4.80 |
| NVIDIA H200 SXM       |               141 | HBM3e |                         4.80 |
| Intel Gaudi3          |               128 | HBM2e |                         3.70 |
| AMD MI250             |               128 | HBM2e |                         3.28 |
| AMD MI250X            |               128 | HBM2e |                         3.28 |
| NVIDIA GH200 SXM (1)  |                96 | HBM3  |                         4.00 |
| Intel Gaudi2          |                96 | HBM2e |                         2.46 |
| AWS Trainium2 / Ultra |                96 | HBM3  |                         2.90 |
| Google TPU v5p        |                95 | HBM2e |                         4.80 |
| NVIDIA H100 SXM       |                80 | HBM3  |                         3.35 |
| NVIDIA A100 SXM       |                80 | HBM2e |                         2.00 |
| NVIDIA H100 PCIe      |                80 | HBM3  |                         2.00 |
| NVIDIA A100 PCIe      |                80 | HBM2e |                         1.94 |
| NVIDIA L40S           |                48 | GDDR6 |                         0.86 |
| Google TPU v4         |                32 | HBM2  |                         1.20 |
| Google TPU v5e        |                16 | HBM2  |                         1.60 |


Notes:

* I didn't include `NVIDIA H100 dual NVL` as it's 2x H100 GPUs with 14GB memory extra per chip and slightly faster memory (3.9TBps vs 3.35TBps) - but it would have an unfair advantage in the above table as everything else is per-chip. (I guess AMD250 is also 2 GCDs, but they aren't very competitive anyway and will soon be displaced from this table by newer offerings)

Memory speed (bandwidth) is, of course, very important since if it's not fast enough, the compute ends up idling waiting for the data to be moved to and from the memory.


### Caches

High performance cache is used for storing frequently used instructions and data. L1 is usually the smallest and fastest, then L2 is a bit larger and a bit slower and there can be an L3 cache which is even bigger and slower. All these caches massively reduce trips to HBM.

The cache size is often important for when running benchmarks - as one needs to reset the cache between each experiment.

It's somewhat difficult to show a comparison of caches on different accelerators because they use different approaches.

Columns:

- The **L3** column is for optional additional caches: Some accelerators have only L1/L2 caches, yet others have additional caches - e.g. MI300X has 256MiB of AMD Infinity cache which they also call Last Level Cache (LLC), and Gaudi3 can have its L2 cache used as L3 cache.

- Units can be different things in different accelerators, e.g. in AMD those would be Accelerator Complex Dies (XCD) or compute dies, for NVIDIA this is usually the SMs, for Intel these are DCOREs (Deep Learning Core).

Sorting by L2 Total, as it seems to be the cache that is in all accelerators listed here.

| Accelerator          | L1/Unit | L2/Unit | Units | L1 Total | L2 Total | L3 Total | Notes |
| :------------------- | ------: | ------: | ----: | -------: | -------: | -------: | :---- |
| Intel Gaudi3         |         | 24MiB   |     4 |          | 96MiB    |          | 2,4   |
| NVIDIA GH100 SXM     | 256KiB  |         |   132 | 33.00MiB | 60MiB    |          |       |
| NVIDIA GH200 SXM     | 256KiB  |         |   132 | 33.00MiB | 60MiB    |          |       |
| NVIDIA H200 SXM      | 192KiB  |         |   132 | 24.75MiB | 50MiB    |          |       |
| NVIDIA H100 SXM      | 192KiB  |         |   132 | 24.75MiB | 50MiB    |          |       |
| Intel Gaudi2         |         |         |       |          | 48MiB    |          | 2,3   |
| NVIDIA A100 SXM      | 128KiB  |         |   108 | 20.25MiB | 40MiB    |          |       |
| NVIDIA A100 PCIe     | 128KiB  |         |   108 | 20.25MiB | 40MiB    |          |       |
| AMD MI355X           |  32KiB  |  4MiB   |     8 |  0.25MiB | 32MiB    | 256MiB   | 1     |
| AMD MI325X           |  32KiB  |  4MiB   |     8 |  0.25MiB | 32MiB    | 256MiB   | 1     |
| AMD MI300X           |  32KiB  |  4MiB   |     8 |  0.25MiB | 32MiB    | 256MiB   | 1     |
|                      |         |         |       |          |          |          |       |
| NVIDIA B200 SXM      | ???     |         |       |          |          |          |       |
| NVIDIA B300 SXM      | ???     |         |       |          |          |          |       |
|                      |         |         |       |          |          |          |       |

Notes:

1. AMD provides L3 AMD Infinity Cache which it also calls Last Level Cache (LLC) in the specs

2. Gaudi has a different architecture than a GPU. In Gaudi’s case, the MME and TPC have private buffer that perform some of the functions of an L1 cache, called Suspension Buffers. The main function that these buffers provide is data reuse from the buffer (instead of reading the same data multiple times from L2/L3/HBM). Both Gaudi2 and Gaudi3 have the same these buffers for the TPC and MME.

3. Gaudi2 doesn’t have a cache. It has scratchpad SRAM instead of a cache, meaning that software determines what goes in or out of the SRAM at any moment. There are dedicated DMA engines that software needs to program to perform all the data movement between SRAM and HBM.

4. The 96MiB cache can be configured by software to be either a single L3 cache or 4 slices of 24MiB L2 cache (this is at tensor-level granularity). L2 configuration is 2x faster than L3.



### Clock speed

Also known as [clock rate](https://en.wikipedia.org/wiki/Clock_rate) this spec tells us at which frequency the card runs. As hardware becomes faster newer generations will typically increase the clock speed.

When you read specs you're likely to see one or two specifications:

- Base clock is the minimal clock at idling accelerator
- Boost clock (or Peak Engine clock) is the guaranteed clock at heavy load - but it might be surpassed.

Often just the boost clock is specified.

These numbers are useful if you need to [calculate theoretical TFLOPS](#how-to-calculate-theoretical-tflops).

I've observed that the same accelerator may have different clock rates published in different specs, probably because not "final" versions are created equal. So always double check your specific accelerator for its actual specs.

Clock speed is in Mhz.

| Accelerator          | Boost Clock | Notes              |
| :------------------- | ----------: | :----------------- |
| AMD MI355X           |        2400 |                    |
| AMD MI300X           |        2100 |                    |
| AMD MI325X           |        2100 |                    |
| NVIDIA GB200 SXM     |        2062 |                    |
| NVIDIA B200 SXM      |        1965 |                    |
| NVIDIA H200 SXM      |        1830 |                    |
| NVIDIA H100 SXM      |        1830 |                    |
| Intel Gaudi2         |        1650 | MME=1650, TPC=1800 |
| Intel Gaudi3         |        1600 | MME=1600, TPC=1600 |
| NVIDIA A100 SXM      |        1410 |                    |
| NVIDIA A100 PCIe     |        1410 |                    |
|                      |             |                    |
| NVIDIA B300 SXM      |           ? |                    |
|                      |             |                    |

Since now there are custom SKUs for the same type of GPU, always check your actual clock, since it could be different from the "general" spec.


Here is how to get the actual clock speed (in particular when your accelerator is under load):

- NVIDIA: `nvidia-settings -q GPUCurrentClockFreqs` with X-server, or `nvidia-smi -q -d CLOCK -i 0 | grep -B2 " SM " | head -3` for headless (adapt `-i 0` if not measuring gpu0, or remove it to show all available GPUs). Remove the `grep` to get the full output - `Max Clocks` shows the theoretical clock. Here is a continuous log of the same with timestamps: `nvidia-smi --query-gpu=index,timestamp,power.draw,clocks.sm,clocks.mem,clocks.gr --format=csv -l 1 -i 0`
- AMD: `rocm-smi -g` for actual and `amd-smi metric --clock` for theoretical
- Intel: `hl-smi -Q clocks.current.soc --format=csv` for actual and `hl-smi -Q clocks.max.soc --format=csv` or `hl-smi -Q clocks.limit.soc --format=csv` for theoretical



### Power consumption

There are three different definitions, whose only difference is which parts of the accelerator card is included in the measurement:

1. **Thermal Design Power (TDP)** is the maximum power that a subsystem is allowed to draw and also the maximum amount of heat an accelerator can generate. This measurement is just for the accelerator chip.
2. **Total Graphics Power (TGP)** is the same as TDP, but additionally includes the PCB's power, yet without cooling and LEDS (if any).
3. **Total Board Power (TBP)** is the same as TGP, but additionally includes cooling and LEDS (if any).

As typically high-end accelerators require external cooling and have no LEDS, TGP and TBP usually imply the same.

The actual power consumption in Watts will vary, depending on whether the accelerator is idle or used to compute something.

If you're a cloud compute user you normally don't care for these values because you're not paying for power consumption directly, as it's already included in your package. For those who host their own hardware these values are important because they tells you how much power and cooling you'd need to keep the hardware running without getting throttled or melting down.

These numbers are also important for knowing how much closer one can get to the theoretical TFLOPS published, the higher the TDP the more efficient the compute will be. For example, while AMD MI325X has the same theoretical compute specs as its MI300X predecessor, the former is more efficient at effective compute because its TDP is 250W higher. In other words, given 2 accelerators with the same or very similar [theoretical compute specs](#tflops-comparison-table) - the one with the higher TDP will be better at sustainable compute.

Some specs report TDP, others TGP/TBP so the table has different columns depending on which measurement has been published. All measurements are in Watts:

| Accelerator           | TGP/TBP |   TDP | Notes |
| :-------------------  | ------: | ----: | :---- |
| AMD MI450X            | 2500    |       |       |
| NVIDIA Rubin SXM      |         |  2300 |       |
| NVIDIA GB300 SXM      |         |  1400 |       |
| AMD MI355X            | 1400    |       |       |
| NVIDIA B300 SXM       |         |  1300 |       |
| NVIDIA GB200 SXM      |         |  1200 |       |
| NVIDIA B200 SXM       |         |  1000 |       |
| AMD MI325X            | 1000    |       |       |
| Intel Gaudi3          |         |   900 |       |
| AMD MI300X            | 750     |       |       |
| NVIDIA H200 SXM       |         |   700 |       |
| NVIDIA H100 SXM       |         |   700 |       |
| Intel Gaudi2          |         |   600 |       |
| NVIDIA H200 NVL       |         |   600 |       |
| AMD MI250X            |         |   560 |       |
| AWS Trainium2 / Ultra |         |   500 |       |
| NVIDIA H100 NVL       |         |   400 |       |
| NVIDIA A100 SXM       |         |   400 | 1     |
| NVIDIA A100 PCIe      |         |   300 |       |
|                       |         |       |       |


1. HGX A100-80GB custom thermal solution (CTS) SKU can support TDPs up to 500W

Additional notes:

1. Google doesn't publish power consumption specs for recent TPUs, the older ones can be found [here](https://en.wikipedia.org/wiki/Tensor_Processing_Unit#Products)



### Cooling

This is of interest when you buy your own hardware, when you rent on the cloud the provider hopefully takes care of adequate cooling.

The only important practical understanding for cooling is that if the accelerators aren't kept cool they will throttle their compute clock and slow everything down and could even crash sometimes, albeit throttling is supposed to prevent that.

For NVIDIA GPUs to check if your GPU gets throttled down, run `nvidia-smi -q -d PERFORMANCE` - if `SW Thermal Slowdown` or some other entries are `Active` - then your are not getting the full performance of your GPU and you need to investigate better cooling.



## High end accelerators for ML workloads

### Cloud accelerators

Most common accelerators that can be either rented on compute clouds or purchased:

NVIDIA:
- [Vera Rubin NVL72]()https://www.nvidia.com/en-us/data-center/vera-rubin-nvl72/ -- the 72 GPUs supernode at NVLink speed with Grace Rubin - 36x blocks of 2x Rubin GPU + Vera CPU)
- [GB300 NVL72](https://www.nvidia.com/en-us/data-center/gb300-nvl72/) - the 72 GPUs supernode at NVLink speed with B300s (Grace Blackwell - 36x blocks of 2x B300 + Grace CPU)
- [GB200 NVL72](https://www.nvidia.com/en-us/data-center/gb200-nvl72/) - the 72 GPUs supernode at NVLink speed with B200s. (Grace Blackwell - 36x blocks of 2x B200 + Grace CPU)
- B300 - no official spec yet - only can be derived from the DGX spec: https://resources.nvidia.com/en-us-dgx-systems/dgx-b300-datasheet (XXX: update when official specs are released)
- B200 - no official spec yet - only can be derived from the DGX spec: https://resources.nvidia.com/en-us-dgx-systems/dgx-b200-datasheet (XXX: update when official specs are released)
- [H200](https://www.nvidia.com/en-us/data-center/h200/) - mainly the same as H100, but with more and faster memory! Supposed to become available some time mid-2024.
- [H100](https://www.nvidia.com/en-us/data-center/h100) - 2-3x faster than A100 (half precision), 6x faster for fp8, has been available on all Tier-1 compute clouds since Q4-2023.
- [GH200](https://www.nvidia.com/en-us/data-center/grace-hopper-superchip/) - 2 chips on one card - (1) H100 w/ 96GB HBM3 or 144GB HBM3e + (2) Grace CPU w/ 624GB RAM - first units have been reported to become available. Do not confuse with H200, which is a different card.
- [L40S](https://www.nvidia.com/en-us/data-center/l40s/) - a powerful card that is supposed to be more than 2x cheaper than H100, and it's more powerful than A100.
- [A100](https://www.nvidia.com/en-us/data-center/a100/#specifications) - huge availability, but already getting outdated. But given the much lower cost than H100 this is still a great GPU.

AMD:
- MI450 - no official spec yet
- [MI355X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi355x.html) ~= B200 - just starting to emerge, mainly on Tier-2 clouds
- [MI350X](https://www.amd.com/en/products/accelerators/instinct/mi350/mi350x.html) ~= B200 - it seems that MI355X is made available instead of MI350X
- [MI325X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi325x.html) ~= H200 - available mainly on Tier-2 clouds
- [MI300X](https://www.amd.com/en/products/accelerators/instinct/mi300/mi300x.html) ~= H100 - available mainly on Tier-2 clouds (lots of new startups)
- [MI250](https://www.amd.com/en/products/accelerators/instinct/mi200/mi250.html) ~= A100 - very few clouds have them


Intel:
- Jaguar shores expected some time in 2026/27
- Falcon shores line has been cancelled
- [Gaudi3](https://habana.ai/products/gaudi3/), somewhat below B200 theoretical TFLOPS-wise - already available on Intel cloud - [spec](https://www.intel.com/content/www/us/en/content-details/817486/intel-gaudi-3-ai-accelerator-white-paper.html)
- [Gaudi2](https://habana.ai/products/gaudi2/) somewhere between A100 and H100 theoretical TFLOPS-wise [spec](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html) - available on Intel cloud. AWS has the older Gaudi1 via [DL1 instances](https://aws.amazon.com/ec2/instance-types/dl1/). It's also available on-premises implementations via Supermicro and WiWynn.

Amazon:
- [Trainium2](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-hardware/trainium2.html) < H100 - available on AWS (works via PyTorch XLA)


Graphcore:
- [IPU](https://www.graphcore.ai/products/ipu) - available via [Paperspace](https://www.paperspace.com/graphcore). the latest product MK2 (C600) has only 0.9GB SRAM per card, so it's not clear how this card can do anything ML-wise - even inference of a small model won't fit its model weights - but there is something new at works at Graphcore, which I'm told we should discover soon. Here is is a good explanation [of how IPU works](https://web.archive.org/web/20250521173833/https://thytu.com/posts/ipus-101/).

SambaNova:
- [DataScale SN30](https://sambanova.ai/products/datascale/)


### On-premises accelerator clusters

[Cerebras](https://www.cerebras.net/) cluster and systems based on WaferScale Engine (WSE).




### Cloud-only solutions

These can be only used via clouds:

Google [TPUs](https://cloud.google.com/tpu), [specs](https://cloud.google.com/tpu/docs/system-architecture-tpu-vm) - lock-in, can't switch to another vendor like NVIDIA -> AMD

Recent architecture specs:
- [v7x](https://docs.cloud.google.com/tpu/docs/tpu7x)
- [v6e](https://docs.cloud.google.com/tpu/docs/v6e)
- [v5p](https://docs.cloud.google.com/tpu/docs/v5p)

Cerebras:
- [Cloud](https://www.cerebras.net/product-cloud/)



### New hardware startups

These are possible future competitors to the big boys.

They typically target inference.

- [TensTorrent](https://tenstorrent.com), [n150s/n300s specs](https://docs.tenstorrent.com/aibs/wormhole/specifications.html)
- [d-Matrix](https://www.d-matrix.ai), [specs](https://www.d-matrix.ai/product/)


### How to get the best price

Remember that the advertised prices are almost always open to negotiations as long as you're willing to buy/rent in bulk or if renting for a 1-3 years. What you will discover is that the actual price that you end up paying could be many times less than the advertised "public" price. Some cloud providers already include the discount as you choose a longer commitment on their website, but it's always the best to negotiate directly with their sales team. In addition or instead of a $$-discount you could be offered some useful features/upgrades for free.

If your company has venture capital investors - it could help a lot to mention that, as then the cloud provider knows you are likely to buy more compute down the road and more likely to discount more.

Tier 2 clouds are likely to give better prices than Tier 1. Tier 1 as of this writing is AWS, OCI, Azure and GCP.

For the baseline prices it should be easy to find a few good sites that provide an up-to-date public price comparisons across clouds - just search for something like [cloud gpu pricing comparison](https://www.google.com/search?q=cloud+gpu+pricing+comparison). Some good starting points: [vast.ai](https://cloud.vast.ai/create/) and specifically for clusters [gpulist.ai](https://gpulist.ai).

When shopping for a solution please remember that it's not enough to rent the most powerful accelerator. You also need fast [intra-node](../../network#intra-node-networking) and [inter-node](../../network#inter-node-networking) connectivity and sufficiently fast [storage](../../storage) - without which the expensive accelerators will idle waiting for data to arrive and you could be wasting a lot money and losing time.



## Accelerators in detail

### NVIDIA

Abbreviations:

- CUDA: Compute Unified Device Architecture (proprietary to NVIDIA)

NVIDIA-specific key GPU characteristics:
- CUDA Cores - similar to CPU cores, but unlike CPUs that typically have 10-100 powerful cores, CUDA Cores are weaker and come in thousands and allow to perform massive general purpose computations (parallelization). Like CPU cores CUDA Cores perform a single operation in each clock cycle.
- Tensor Cores - special compute units that are designed specifically to perform fast multiplication and addition operations like matrix multiplication. These perform multiple operations in each clock cycle. They can execute extremely fast computations on low or mixed precision data types with some loss (fp16, bf16, tf32, fp8, etc.). These cores are specifically designed for ML workloads.
- Streaming Multiprocessors (SM) are clusters of CUDA Cores, Tensor Cores and other components.

For example, A100-80GB has:

- 6912 CUDA Cores
- 432 Tensor Cores (Gen 3)
- 108 Streaming Multiprocessors (SM)

H100 has:

- 16896 FP32 CUDA Cores
- 528 Tensor Cores (Gen 4)
- 132 Streaming Multiprocessors (SM)


### AMD

AMD-specific key GPU characteristics:
- Stream Processors - are similar in functionality to CUDA Cores - that is these are the parallel computation units. But they aren't the same, so one can't compare 2 gpus by just comparing the number of CUDA Cores vs the number of Stream Processors.
- Compute Units - are clusters of Stream Processors and other components

for example, AMD MI250 has:
- 13,312 Stream Processors
- 208 Compute Units

[AMD's table comparing its high-end gpus](https://rocm.docs.amd.com/en/latest/reference/gpu-arch-specs.html)

### Intel Gaudi

[Architecture](https://docs.habana.ai/en/latest/Gaudi_Overview/Gaudi_Architecture.html)

- 24x 100 Gigabit Ethernet (RoCEv2) integrated on chip - 21 of which are used for intra-node and 3 for inter-node (so `21*8=168` cards for intra-node (262.5GBps per GPU), and `3*8=24` cards for inter-node (2.4Tbps between nodes)
- 96GB HBM2E memory on board w/2.45 TBps bandwidth per chip, for a total of 768GB per node

A server/node is built from 8 GPUs, which can then be expanded with racks of those servers.

There are no official TFLOPS information published (and from talking to an Intel representative they have no intention to publish any.) They publish the [following benchmarks](https://www.intel.com/content/www/us/en/developer/platform/gaudi/model-performance.html) but I'm not sure how these can be used to compare this compute to other providers.

Comparison: supposedly Gaudi2 competes with NVIDIA H100





## API

Which software is needed to deploy the high end GPUs?


### NVIDIA

NVIDIA GPUs run on [CUDA](https://developer.nvidia.com/cuda-toolkit)

### AMD

AMD GPUs run on [ROCm](https://www.amd.com/en/products/software/rocm.html) - note that PyTorch you can use CUDA-based software on ROCm-based GPUs! So it should be trivial to switch to the recent AMD MI250, MI300X, and other emerging ones.

### Intel Gaudi

The API is via [Habana SynapseAI® SDK](https://habana.ai/training-software/) which supports PyTorch and TensorFlow.

Useful integrations:
- [HF Optimum Habana](https://github.com/huggingface/optimum-habana) which also includes - [DeepSpeed](https://github.com/deepspeedai/DeepSpeed) integration.







## Apples-to-apples Comparison

It's very difficult to compare specs of different offerings since marketing tricks get deployed pretty much by all competitors so that one can't compare 2 sets of specs and know the actual difference.

- [MLPerf via MLCommons](https://mlcommons.org/en/) publishes various hardware benchmarks that measure training, inference, storage and other tasks' performance. For example, here is the most recent as of this writing [training v3.0](https://mlcommons.org/en/training-normal-30/) and [inference v3.1](https://mlcommons.org/en/inference-datacenter-31/) results.

   Except I have no idea how to make use of it - it's close to impossible to make sense of or control the view. This is a great intention lost in over-engineering and not thinking about how the user will benefit from it, IMHO. For example, I don't care about CV data, I only want to quickly see the LLM rows, but I can't do it. And then the comparisons are still not apples to apples so how can you possibly make sense of which hardware is better I don't know.



## Power and Cooling

It is most likely that you're renting your accelerator nodes and someone else is responsible for ensuring they function properly, but if you own the accelerators you do need to know how to supply a sufficient power and adequate cooling.


### Power

Some high end consumer GPU cards have 2 and sometimes 3 PCI-E 8-Pin power sockets. Make sure you have as many independent 12V PCI-E 8-Pin cables plugged into the card as there are sockets. Do not use the 2 splits at one end of the same cable (also known as pigtail cable). That is if you have 2 sockets on the GPU, you want 2 PCI-E 8-Pin cables going from your PSU to the card and not one that has 2 PCI-E 8-Pin connectors at the end! You won't get the full performance out of your card otherwise.

Each PCI-E 8-Pin power cable needs to be plugged into a 12V rail on the PSU side and can supply up to 150W of power.

Some other cards may use a PCI-E 12-Pin connectors, and these can deliver up to 500-600W of power.

Low end cards may use 6-Pin connectors, which supply up to 75W of power.

Additionally you want the high-end PSU that has stable voltage. Some lower quality ones may not give the card the stable voltage it needs to function at its peak.

And of course the PSU needs to have enough unused Watts to power the card.



### Cooling

When a GPU gets overheated it will start throttling down and will not deliver full performance and it can even shutdown if it gets too hot.

It's hard to tell the exact best temperature to strive for when a GPU is heavily loaded, but probably anything under +80C is good, but lower is better - perhaps 70-75C is an excellent range to be in. The throttling down is likely to start at around 84-90C. But other than throttling performance a prolonged very high temperature is likely to reduce the lifespan of a GPU.


================================================
FILE: compute/accelerator/amd/debug.md
================================================
# Troubleshooting AMD GPUs

XXX: this is very early - collecting various tools/notes

As most of us are well familiar with NVIDIA tools, I will try to provide the mapping where possible to the familiar tools.

## Tools

### ROCR_VISIBLE_DEVICES

To select a specific gpu (`CUDA_VISIBLE_DEVICES` equivalent):

```
ROCR_VISIBLE_DEVICES=0,1 python my-program.py
```

### rocm-smi

`rocm-smi` (`nvidia-smi` equivalent) shows a condensed state of all the ROCm accelerators.

For example here is an 8xMI300X node:
```
$ rocm-smi
========================================= ROCm System Management Interface =========================================
=================================================== Concise Info ===================================================
Device  [Model : Revision]    Temp        Power     Partitions      SCLK    MCLK    Fan  Perf  PwrCap  VRAM%  GPU%
        Name (20 chars)       (Junction)  (Socket)  (Mem, Compute)
====================================================================================================================
0       [0x74a1 : 0x00]       45.0°C      173.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
1       [0x74a1 : 0x00]       41.0°C      179.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
2       [0x74a1 : 0x00]       47.0°C      180.0W    NPS1, SPX       131Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
3       [0x74a1 : 0x00]       45.0°C      178.0W    NPS1, SPX       131Mhz  900Mhz  0%   auto  750.0W   17%   0%
        AMD Instinct MI300X
4       [0x74a1 : 0x00]       45.0°C      175.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
5       [0x74a1 : 0x00]       43.0°C      175.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
6       [0x74a1 : 0x00]       45.0°C      175.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
7       [0x74a1 : 0x00]       43.0°C      176.0W    NPS1, SPX       132Mhz  900Mhz  0%   auto  750.0W    0%   0%
        AMD Instinct MI300X
====================================================================================================================
=============================================== End of ROCm SMI Log ================================================
```

Oddly it shows no real memory usage - only the percentage, which isn't very practical.

A handy alias to watch updates in real time:
```
alias wr='watch -n 1 rocm-smi'
```

### rocminfo

`rocminfo` (`nvidia-smi -q` equivalent) shows the detailed information about each accelerator.

This one shows both the CPU and the GPU information

Here is a snippet for cpu0 and gpu0 (note it starts counting the cpus as nodes 0..1, and then GPUs as nodes 2..9):
```
$ rocminfo
ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD EPYC 9534 64-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD EPYC 9534 64-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2450
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            128
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    792303268(0x2f3996a4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    792303268(0x2f3996a4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    792303268(0x2f3996a4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
[...]

  Name:                    gfx942
  Uuid:                    GPU-ababaeeffecddc50
  Marketing Name:          AMD Instinct MI300X
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    2
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
    L2:                      8192(0x2000) KB
  Chip ID:                 29857(0x74a1)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   2100
  BDFID:                   50688
  Internal Node ID:        7
  Compute Unit:            304
  SIMDs per CU:            4
  Shader Engines:          32
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        32(0x20)
  Max Work-item Per CU:    2048(0x800)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 132
  SDMA engine uCode::      19
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    201310208(0xbffc000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 4
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx942:sramecc+:xnack-
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
```


================================================
FILE: compute/accelerator/amd/performance.md
================================================
# AMD GPUs Performance

As I haven't had a chance to do any serious work with AMD GPUs, just sharing links for now.

- [AMD Instinct MI300X system optimization](https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html)
- [AMD Instinct MI300X workload optimization](https://rocm.docs.amd.com/en/latest/how-to/tuning-guides/mi300x/workload.html)

## Profilers

[omniperf](https://github.com/ROCm/omniperf) - Advanced Profiling and Analytics for AMD Hardware - e.g. can plot a roofline performance of your AMD accelerator and many other things.


================================================
FILE: compute/accelerator/benchmarks/README.md
================================================
# Accelerator Benchmarks

## Maximum Achievable Matmul FLOPS Finder

Maximum Achievable Matmul FLOPS (MAMF) Benchmark: [mamf-finder.py](./mamf-finder.py) was derived from research found in [The Case for Co-Designing Model Architectures with Hardware](https://arxiv.org/abs/2401.14489) paper.

For a detailed discussion and the numbers for various accelerators see [Maximum Achievable FLOPS](../#maximum-achievable-flops).

While some accelerator manufacturers publish the theoretical TFLOPS these usually can't be reached. As a result of this when we try to optimize our software we have no realistic performance bar to compare ourselves to. The Model FLOPS Utilization (MFU) metric measures TFLOPS achieved against theoretical TFLOPS. Usually when one scores around 50% MFU it's considered a win. But this gives us no indication how far are we from the real achievable throughput.

This benchmark scans various large shapes of matmul and reports the highest achievable TFLOPS it registered. As transformers training and partially inference workloads are dominated by large matmul operations it's safe to use the best matmul TFLOPS one can measure on each accelerator as a rough estimation that this is the Maximum Achievable Matmul FLOPS (MAMF). Now instead of the previously used MFU, one can use Model Achievable Matmul FLOPS Utilization (MAMFU).

Therefore now you can compare the TFLOPS you measured for your training or inference against a realistic number. As you will now be much closer to 100% it'll be much easier to know when to stop optimizing.

Currently supported high end architectures:
- NVIDIA: V100, A100, H100, ...
- AMD: MI250, MI300X, MI325X, ...
- Intel Gaudi2/3

Important notes:
- if you can find a better and more efficient way to detect the best matmul TFLOPS by approaching each new accelerator as a black box, please kindly send a PR with the improvement including the generated log file.
- also if you know that this benchmark should be run under special conditions to show the best results, such as some kernel settings or similar, please submit a PR to add such special instructions. For example, for AMD MI300X I'm being told disabling the numa_balancing is supposed to help.
- since a big part of the overhead comes from HBM IO, if you're using a fused kernel with 2 or more matmuls, whose results don't leave the accelerator's registers, the performance will be definitely faster than what this benchmark reports.
- It also helps to sample your accelerator's actual clock speed. If your accelerator is running at a slower clock than the one used in the spec, there is no chance you can get the theoretical TFLOPS (see [How To Calculate Theoretical TFLOPS](../README.md#how-to-calculate-theoretical-tflops)).

### Architecture specific notes:

Follow the special setup instructions before running the benchmark to achieve the best results:

**MI300x, MI325X, etc.**:

1. Turn numa_balancing off for better performance:
```
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
```
2. Enable:
```
export PYTORCH_TUNABLEOP_ENABLED=1
```
This will make the first iteration very slow, while it's searching for the best GEMM algorithm in the BLAS libraries for each `matmul` shape it encounters, but subsequent operations are likely to be significantly faster than the baseline. See [Accelerating models on ROCm using PyTorch TunableOp](https://rocm.blogs.amd.com/artificial-intelligence/pytorch-tunableop/README.html) (requires `torch>=2.3`) [doc](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/cuda/tunable/README.md).

**Intel dGPUs (A770, A750, B580, etc.)**
- Follow Intel Extension for Pytorch [installation steps](https://pytorch-extension.intel.com/installation?platform=gpu)

### Examples of usage

In the ranges below `K` is the reduction dimension so that `(MxK)*(KxN)=(MxN)` and we print the MxKxN shape for the best measured TFLOPS.

Also by default we use 50 warmup and 100 measured iterations for each shape and then fastest result is picked (not the average). You can change the number of iterations via the args `--num_warmup_iterations` and `--num_iterations` correspondingly.

You can specify the data type via `--dtype` argument, it has to be one of the valid `torch` dtypes - e.g., `float8_e4m3fn`, `float8_e4m3fnuz` (AMD), `float16`, `bfloat16`, `float32`, etc. If not specified, `bfloat16` is used.

Here we do `torch.mm(MxK,KxN) -> MxN`

1. A quick run (under 1min) - should give around 80-90% of the maximum achievable result - good for a quick try out, but not enough to get a high measurement.

```
./mamf-finder.py --m_range 0 20480 256 --n 4096 --k 4096 --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

2. A more exhaustive search (15-30min) - but you can Ctrl-C it when it run long enough and get the best result so far:

```
./mamf-finder.py --m_range 0 16384 1024 --n_range 0 16384 1024 --k_range 0 16384 1024  --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

Feel free to make the steps smaller from 1024 to 512 or 256 - but it'd 8x or 64x the run time correspondingly. 1k steps should cover the different shape ranges well and fast.

3. A super long exhaustive search (may take many hours/days) - but you can Ctrl-C it when it run long enough and get the best result so far:

```
./mamf-finder.py --m_range 0 20480 256 --n_range 0 20480 256 --k_range 0 20480 256 --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

4. If you want to measure a specific shape that is used by your training, use the exact shape, instead of the range, so let's say you wanted to measure 1024x1024x1024 - you'd run:

```
./mamf-finder.py --m 1024 --n 1024 --k 1024 --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

5. Accelerator specific range seeking suggestions

But then it appears that different accelerators have different ranges of shapes that lead to best TFLOPS, thus it's difficult to suggest a range that will work well for all of them - instead here are some suggestions based on experiments and suggestions from contributors:


6. fp8 on NVIDIA GPUs example:

```
./mamf-finder.py --m_range 0 20480 1024 --n_range 0 20480 1024 --k_range 0 20480 1024 --dtype float8_e4m3fn --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

- **A100** + **MI300X**

```
./mamf-finder.py --m_range 0 5376 256 --n_range 0 5376 256 --k_range 0 5376 256 --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

- **H100**

```
./mamf-finder.py --m_range 0 20480 256 --n_range 0 20480 256 --k_range 0 20480 256 --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt
```

To understand better which shapes give the highest matmul FLOPS for a particular accelerator, see [Vector and matrix size divisibility](../../../training/performance/README.md#vector-and-matrix-size-divisibility).


### Results

The measurements that I have gathered so far can be found at [Maximum Achievable Matmul FLOPS comparison table](../#maximum-achievable-matmul-flops-comparison-table). When I had access to a particular accelerator I run the benchmarks myself, when I didn't it was the kind contributors who invested their time to get these numbers. So I'm very grateful to [those](../../../contributors.md).




## How to benchmark accelerators

### CUDA benchmakrs

There are a few excellent detailed write ups on how to perform CUDA benchmarks:

1. [How to Accurately Time CUDA Kernels in Pytorch](https://www.speechmatics.com/company/articles-and-news/timing-operations-in-pytorch)
2. [How to Benchmark Code on CUDA Devices?](https://salykova.github.io/sgemm-gpu#2-how-to-benchmark-code-on-cuda-devices) - this one is different from (1) in that it suggests to set both GPU and Memory clocks, whereas (1) only locks the GPU clock.

You can see these instructions applied in [mamf-finder.py](./mamf-finder.py) (other than clock locking)

Here are some excellent related reads:

- Horace's [Strangely, Matrix Multiplications on GPUs Run Faster When Given "Predictable" Data](https://www.thonking.ai/p/strangely-matrix-multiplications?utm_source=substack&publication_id=1781836&post_id=142508107) shows how benchmarking can be over-reporting if one uses a not normally distributed data and how power impacts performance.


================================================
FILE: compute/accelerator/benchmarks/mamf-finder.py
================================================
#!/usr/bin/env python

"""

This is Maximum Achievable Matmul FLOPS (MAMF) Finder

For a quick run use:

python mamf-finder.py --m_range 0 20480 256 --n 4096 --k 4096 --output_file=$(date +'%Y-%m-%d-%H:%M:%S').txt

But this usually is an insufficient range to get the best results, therefore for multiple examples, discussion and multiple important nuances please refer to
https://github.com/stas00/ml-engineering/tree/master/compute/accelerator/benchmarks#maximum-achievable-matmul-flops-finder

The results are shared here: https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#maximum-achievable-matmul-flops-comparison-table

Credits:
- Parts of this benchmark have been derived from https://github.com/EleutherAI/cookbook/tree/main/benchmarks/sizing (highly recommended!)
- Imtiaz Sajwani: HPU porting
- Xiaoyu Zhang https://github.com/BBuf - flexible dtype support
- Oren Leung https://github.com/OrenLeung - flagging the lack of cache/dest-matrix reset and suggesting a fix - also proposing geomean
- Ivan Fioravanti https://github.com/ivanfioravanti - MPS support
"""

from pathlib import Path

import argparse
import datetime
import numpy as np
import os
import platform
import re
import shlex
import signal
import sys
import time
import torch
from packaging import version
from warnings import warn

# important: when changing how the benchmark measures things bump up its version, so that the old
# reports could be differentiated from the new ones
benchmark_version = 2

has_hpu = False
try:
    import habana_frameworks.torch as ht
    if torch.hpu.is_available():
        has_hpu = True
except ModuleNotFoundError:
    pass

file_dir = os.path.abspath(os.path.dirname(__file__))

def get_torch_dtype(dtype_str):
    """Convert string dtype to torch dtype object."""
    try:
        return getattr(torch, dtype_str)
    except AttributeError:
        raise ValueError(f"Unsupported dtype: {dtype_str}. Must be a valid torch dtype name.")



### Architecture specific helper classes ###

class Arch:
    def __init__(self):
        self.arch = "unknown"

    def __repr__(self):
        return self.arch

class CUDAArch(Arch):
    """ shared with CUDA and ROCm: NVIDIA + AMD """
    def __init__(self):
        if torch.version.hip is not None:
            self.arch = "rocm"
        else:
            self.arch = "cuda"

    @property
    def device(self):
        return torch.device('cuda:0')

    @property
    def name(self):
        return self.arch

    @property
    def device_info(self):
        return torch.cuda.get_device_properties(device)

    @property
    def compute_info(self):
        if self.arch == "rocm":
            return f"hip={torch.version.hip}, cuda={torch.version.cuda}"
        else:
            return f"cuda={torch.version.cuda}"

    def event(self, enable_timing=True):
        return torch.cuda.Event(enable_timing)

    def synchronize(self):
        torch.cuda.synchronize()

class HPUArch(Arch):
    """ Intel Gaudi* """
    def __init__(self):
        self.arch = "hpu"

    @property
    def device(self):
        return torch.device('hpu')

    @property
    def name(self):
        return self.arch

    @property
    def device_info(self):
        return torch.hpu.get_device_properties(device)

    @property
    def compute_info(self):
        return f"hpu={torch.hpu}"

    def event(self, enable_timing=True):
        return ht.hpu.Event(enable_timing)

    def synchronize(self):
        ht.hpu.synchronize()

class XPUArch(Arch):
    """ Intel dGPUs (like ARC A770) """
    def __init__(self):
        self.arch = "xpu"

    @property
    def device(self):
        return torch.device('xpu')

    @property
    def name(self):
        return self.arch

    @property
    def device_info(self):
        return torch.xpu.get_device_properties(device)

    @property
    def compute_info(self):
        return f"xpu={torch.version.xpu}"

    def event(self, enable_timing=True):
        return torch.xpu.Event(enable_timing)

    def synchronize(self):
        torch.xpu.synchronize()

class MPSEvent:
    """Fallback event implementation for Apple's MPS backend."""
    def __init__(self):
        self._timestamp = None

    def record(self):
        torch.mps.synchronize()
        self._timestamp = time.perf_counter()

    def elapsed_time(self, other):
        if self._timestamp is None or other._timestamp is None:
            raise RuntimeError("Attempted to measure elapsed time before events were recorded")
        return (other._timestamp - self._timestamp) * 1000.0

class MPSArch(Arch):
    """ Apple Silicon GPUs via Metal Performance Shaders """
    def __init__(self):
        self.arch = "mps"

    @property
    def device(self):
        return torch.device('mps')

    @property
    def name(self):
        return self.arch

    @property
    def device_info(self):
        return "Apple Metal Performance Shaders (MPS)"

    @property
    def compute_info(self):
        driver_version = None
        if hasattr(torch.backends, "mps") and hasattr(torch.backends.mps, "driver_version"):
            try:
                driver_version = torch.backends.mps.driver_version()
            except TypeError:
                # driver_version may be a property on some torch releases
                driver_version = torch.backends.mps.driver_version
        if driver_version:
            return f"mps={driver_version}"
        return "mps"

    def event(self, enable_timing=True):
        return MPSEvent()

    def synchronize(self):
        torch.mps.synchronize()

def get_accelerator_arch():
    """
    returns: CUDAArch or HPUArch object
    """
    # cuda / rocm
    if torch.cuda.is_available():
        return CUDAArch()

    # hpu
    if has_hpu:
        return HPUArch()

    if torch.xpu.is_available():
        return XPUArch()

    if hasattr(torch.backends, "mps") and torch.backends.mps.is_available():
        return MPSArch()

    raise ValueError("Currently only cuda, rocm, hpu, xpu and mps are supported")

arch = get_accelerator_arch()



### Helper classes ###

class Tee(object):
    def __init__(self, filename, verbose):
        Path(filename).resolve().parent.mkdir(parents=True, exist_ok=True)
        self.file = open(filename, "w")
        self.verbose = verbose
        if self.verbose:
            self.stdout = sys.stdout

    def write(self, message):

        if self.verbose:
            self.stdout.write(message)
        # replace `\r` and `033\[K` which are nice in the console, but we don't want those in the log file
        message = re.sub(r"(\r|\033\[K)", "\n", message)
        self.file.write(message)

    def flush(self):
        self.file.flush()
        if self.verbose:
            self.stdout.flush()


def print_benchmark_header(dtype, device, notes="None"):

    device_info = arch.device_info
    compute_info = arch.compute_info

    print(f"""
Benchmark started on {time.strftime('%Y-%m-%d %H:%M:%S', time.localtime())}

** Command line:
{sys.executable} {" ".join(map(shlex.quote, sys.argv))}

** Dtype: {dtype}

** Platform/Device info:
- {" ".join(platform.uname())}
- {device_info}

** Critical software versions:
- torch={torch.__version__}
- {compute_info}

** Critical environment variables:
- PYTORCH_TUNABLEOP_ENABLED={os.environ.get("PYTORCH_TUNABLEOP_ENABLED", "0")}

** Additional notes:
- benchmark version: {benchmark_version}
{notes}

{"-" * 80}

""")

# Benchmark of a basic GEMM
def benchmark_mm(m, n, k, dtype, device, num_iterations, num_warmup_iterations):
    start = arch.event(enable_timing=True)
    end = arch.event(enable_timing=True)

    # this will be used to write to the accelerator between each benchmark iteration to emulate cache reset.
    # On AMD this will really be an l3/LLC cache - later need to figure out how to get the maximum cache
    # size automatically, according to this table 256MB is the highest value so far across all
    # recent accelerators:
    # https://github.com/stas00/ml-engineering/tree/master/compute/accelerator#caches
    l2_cache_size_in_mbs = 256
    l2_cache = torch.empty(int(l2_cache_size_in_mbs * 2**20 / 4), dtype=torch.int, device=device)

    C = torch.empty(m, n, dtype=dtype, device=device).contiguous()
    # this random matrix will be used in the loop to ensure that C gets actually written to, as
    # otherwise the rerun results will be always the same and no power will be drawn to write - would lead
    # to invalid emulation of a real use case
    C_rand = torch.randn(m, n, device=device).to(dtype=dtype).contiguous()

    def time_it(iters=1):
        def decorator(func):
            def func_wrapper(*args, **kwargs):
                start_events = [arch.event(enable_timing=True) for _ in range(iters)]
                end_events = [arch.event(enable_timing=True) for _ in range(iters)]

                for i in range(iters):
                    with torch.no_grad():
                        l2_cache.zero_() # clear accelerator cache
                        C.copy_(C_rand)  # re-randomize the target matrix
                        start_events[i].record()
                        ret = func(*args, **kwargs)
                        end_events[i].record()
                arch.synchronize()
                times = np.array([s.elapsed_time(e) for s, e in zip(start_events, end_events)])
                return times
            return func_wrapper
        return decorator

    total_iterations = num_iterations + num_warmup_iterations

    # fp8 requires special handling depending on the vendor:
    # float8_e4m3fn for nvidia, float8_e4m3fnuz for amd
    fp8_dtypes = [torch.float8_e4m3fn, torch.float8_e4m3fnuz]
    if dtype in fp8_dtypes:
        # torch._scaled_mm is different before pt-2.5
        if version.parse(torch.__version__) < version.parse("2.5"):
            raise ValueError("float8 dtypes require torch>=2.5")
        if dtype == torch.float8_e4m3fn and arch.name == "rocm":
            raise ValueError("ROCm doesn't support float8_e4m3fn, use --dtype float8_e4m3fnuz instead")

        A = torch.randn(m, k, dtype=torch.float32, device=device).contiguous()
        B = torch.randn(n, k, dtype=torch.float32, device=device).contiguous().t()
        scale = torch.tensor([1.0]).to(device)
        A = A.to(dtype)
        B = B.to(dtype)

        # Simplified call for PyTorch 2.5+
        @time_it(total_iterations)
        def time_iterations():
            # must not move `out=C` as `C = ...` as Gaudi needs it this way to work
            torch._scaled_mm(A, B, scale, scale, out=C)

    else:
        A = torch.randn(m, k, dtype=dtype, device=device).contiguous()
        B = torch.randn(n, k, dtype=dtype, device=device).contiguous().t()

        @time_it(total_iterations)
        def time_iterations():
            torch.mm(A, B, out=C)

    times = time_iterations()[num_warmup_iterations:]
    flos = 2 * m * n * k

    mean_elapsed_time = np.mean(times)/1000
    mean_tflops = flos / (mean_elapsed_time * 10**12)

    median_elapsed_time = np.median(times)/1000
    median_tflops = flos / (median_elapsed_time * 10**12)

    min_elapsed_time = np.amin(times)/1000
    max_tflops = flos / (min_elapsed_time * 10**12)

    return mean_tflops, median_tflops, max_tflops

def setup_checks():
    if arch.name == "rocm":
        if int(os.environ.get("PYTORCH_TUNABLEOP_ENABLED", "0")) == 0:
            warn("AMD GPUs usually require `export PYTORCH_TUNABLEOP_ENABLED=1` to measure the best possible compute, but it hasn't been set. Proceeding as is - expect potentially bad/invalid results.")

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    m_group = parser.add_mutually_exclusive_group(required=True)
    m_group.add_argument("--m", nargs="+", type=int, help='The first dimension of the GEMM, enter any number of arguments')
    m_group.add_argument("--m_range", nargs='+', type=int, help="The first dimension of the GEMM, [start,stop,step]")

    n_group = parser.add_mutually_exclusive_group(required=True)
    n_group.add_argument("--n", nargs="*", type=int, help='The last dimension of the GEMM, enter any number of arguments')
    n_group.add_argument("--n_range", nargs='+', type=int, help="The last dimension of the GEMM, [start,stop,step]")

    k_group = parser.add_mutually_exclusive_group(required=True)
    k_group.add_argument("--k", nargs="*", type=int, help='The shared (reduction) dimension of the GEMM, enter any number of arguments')
    k_group.add_argument("--k_range", nargs='+', type=int, help="The shared (reduction) dimension of the GEMM, [start,stop,step]")

    parser.add_argument("--num_iterations", type=int, default=100, help='The number of iterations used to benchmark each GEMM')
    parser.add_argument("--num_warmup_iterations", type=int, default=50, help='The number of warmup iterations')
    parser.add_argument("--cuda_device", type=int, default=0, help="The cuda device to run the benchmark on")
    parser.add_argument("--output_file", type=str, default=f"{file_dir}/results/mm.out")
    parser.add_argument("--notes", type=str, default="", help="benchmark-specific notes to add to the output_file's header")
    parser.add_argument("--verbose", default=True, action=argparse.BooleanOptionalAction, help='log to stdout besides output_file?')
    parser.add_argument("--dtype", type=str, default="bfloat16",
                        help="Data type to use for the benchmark (e.g., float32, float16, bfloat16, float8_e4m3fn, torch.float8_e4m3fnuz)")
    args = parser.parse_args()

    m = args.m
    n = args.n
    k = args.k

    dtype = get_torch_dtype(args.dtype)
    device = arch.device

    setup_checks()

    range_info = (
        f"m={args.m_range if m is None else args.m} | "
        f"n={args.n_range if n is None else args.n} | "
        f"k={args.k_range if k is None else args.k}"
    )

    if m is None:
        start, stop, step = args.m_range
        if start == 0: # can't have a 0 dimension
            start = step
        m = np.arange(start, stop, step)
    if n is None:
        start, stop, step = args.n_range
        if start == 0: # can't have a 0 dimension
            start = step
        n = np.arange(start, stop, step)
    if k is None:
        start, stop, step = args.k_range
        if start == 0: # can't have a 0 dimension
            start = step
        k = np.arange(start, stop, step)

    sys.stdout = Tee(args.output_file, args.verbose)
    print_benchmark_header(dtype, device, args.notes)

    # this is useful for when one wants to interrupt the run - and still report the best outcome so far
    def sigkill_handler(signum, frame):
         finish()
         sys.exit(1)

    signal.signal(signal.SIGINT, sigkill_handler)

    best_tflops = dict(max=0, median=0, mean=0)
    best_config = dict(max="", median="", mean="")
    num_shapes = 0
    all_mean_tflops = []
    start_time = time.time()

    def finish():

        all_tried_shapes_geometric_mean_tflops  = np.exp(np.log(all_mean_tflops).mean())
        all_tried_shapes_arithmetic_mean_tflops = np.mean(all_mean_tflops)

        time_delta = time.time() - start_time
        time_str = str(datetime.timedelta(seconds=time_delta)).split(".")[0]
        print("", end="\033[K")
        print(f"""
Tried {num_shapes} shapes => the best outcomes were:
mean:   {best_tflops["mean"]:.1f} TFLOPS @ {best_config["mean"]}
median: {best_tflops["median"]:.1f} TFLOPS @ {best_config["median"]}
max:    {best_tflops["max"]:.1f} TFLOPS @ {best_config["max"]}

Across {num_shapes} shapes in range: {range_info} in this run:
arithmetic mean: {all_tried_shapes_arithmetic_mean_tflops:.1f} TFLOPS
geometric mean:  {all_tried_shapes_geometric_mean_tflops:.1f} TFLOPS
""")
        print(f"Legend: TFLOPS = 10**12 FLOPS")
        print(f"Elapsed time: {time_str}")

    # XXX: the transpose version seemed to work better for MI300X

    # always start with additional warmup iterations to give fare results, otherwise based on
    # rerunning this benchmark many times - a cold accelerator gives a higher score on say a single
    # shape, than the same shape run after a dozen of other shapes
    accelerator_warmup_seconds = 30
    end_time = time.monotonic() + accelerator_warmup_seconds
    print(f"Warming up the accelerator for {accelerator_warmup_seconds} secs ... ", end="", flush=True)
    while time.monotonic() < end_time:
        _ = benchmark_mm(m[0], n[0], k[0], dtype, device, args.num_iterations, args.num_warmup_iterations)
    print("accelerator warmup finished")

    # loop through all sizes to benchmark
    for M in m:
        for N in n:
            for K in k:
                num_shapes += 1
                mean_tflops, median_tflops, max_tflops = benchmark_mm(M, N, K, dtype, device, args.num_iterations, args.num_warmup_iterations)
                all_mean_tflops.append(mean_tflops)

                cur_config = f"{M}x{N}x{K}"
                if median_tflops > best_tflops["median"]:
                    best_tflops["median"] = median_tflops
                    best_config["median"] = f"{cur_config} (MxNxK)"
                if mean_tflops > best_tflops["mean"]:
                    best_tflops["mean"] = mean_tflops
                    best_config["mean"] = f"{cur_config} (MxNxK)"
                if max_tflops > best_tflops["max"]:
                    best_tflops["max"] = max_tflops
                    best_config["max"] = f"{cur_config} (MxNxK)"

                print(f"{num_shapes:>6} | {mean_tflops:6.1f}(mean) {median_tflops:6.1f}(median) {max_tflops:6.1f}(max) @ {cur_config:<20} | best: {best_tflops['mean']:6.1f}(mean) {best_tflops['median']:6.1f}(median) {best_tflops['max']:6.1f}(max) TFLOPS", end="\r")
    finish()


================================================
FILE: compute/accelerator/nvidia/debug.md
================================================
# Troubleshooting NVIDIA GPUs

## Glossary

- DBE: Double Bit ECC Error
- DCGM: (NVIDIA) Data Center GPU Manager
- ECC: Error-Correcting Code
- FB: Frame Buffer
- SBE: Single Bit ECC Error
- SDC: Silent Data Corruption

## Xid Errors

No hardware is perfect, sometimes due to the manufacturing problems or due to tear and wear (especially because of exposure to high heat), GPUs are likely to encounter various hardware issues. A lot of these issues get corrected automatically without needing to really understand what's going on. If the application continues running usually there is nothing to worry about. If the application crashes due to a hardware issue it's important to understand why this is so and how to act on it.

A normal user who uses a handful of GPUs is likely to never need to understand GPU-related hardware issues, but if you come anywhere close to massive ML training where you are likely to use hundreds to thousands of GPUs it's certain that you'd want to understand about different hardware issues.

In your system logs you are likely to see occasionally Xid Errors like:

```
NVRM: Xid (PCI:0000:10:1c): 63, pid=1896, Row Remapper: New row marked for remapping, reset gpu to activate.
```

To get those logs one of the following ways should work:
```
sudo grep Xid /var/log/syslog
sudo dmesg -T | grep Xid
```

Typically, as long as the training doesn't crash, these errors often indicate issues that automatically get corrected by the hardware.

The full list of Xid Errors and their interpretation can be found [here](https://docs.nvidia.com/deploy/xid-errors/index.html).

You can run `nvidia-smi -q` and see if there are any error counts reported. For example, in this case of Xid 63, you will see something like:

```
Timestamp                                 : Wed Jun  7 19:32:16 2023
Driver Version                            : 510.73.08
CUDA Version                              : 11.6

Attached GPUs                             : 8
GPU 00000000:10:1C.0
    Product Name                          : NVIDIA A100-SXM4-80GB
    [...]
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 177
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 0
            DRAM Correctable              : 177
            DRAM Uncorrectable            : 0
    Retired Pages
        Single Bit ECC                    : N/A
        Double Bit ECC                    : N/A
        Pending Page Blacklist            : N/A
    Remapped Rows
        Correctable Error                 : 1
        Uncorrectable Error               : 0
        Pending                           : Yes
        Remapping Failure Occurred        : No
        Bank Remap Availability Histogram
            Max                           : 639 bank(s)
            High                          : 1 bank(s)
            Partial                       : 0 bank(s)
            Low                           : 0 bank(s)
            None                          : 0 bank(s)
[...]
```

Here we can see that Xid 63 corresponds to:

```
ECC page retirement or row remapping recording event
```

which may have 3 causes: HW Error / Driver Error / FrameBuffer (FB) Corruption

This error means that one of the memory rows is malfunctioning and that upon either reboot and/or a gpu reset one of the 640 spare memory rows (in A100) will be used to replace the bad row. Therefore we see in the report above that only 639 banks remain (out of 640).

The Volatile section of the `ECC Errors` report above refers to the errors recorded since last reboot/GPU reset. The Aggregate section records the same error since the GPU was first used.

Now, there are 2 types of errors - Correctable and Uncorrectable. The correctable one is a Single Bit ECC Error (SBE) where despite memory being faulty the driver can still recover the correct value. The uncorrectable one is where more than one bit is faulty and it's called Double Bit ECC Error (DBE). Typically, the driver will retire whole memory pages if 1 DBE or 2 SBE errors occur at the same memory address. For full information see [this document](https://docs.nvidia.com/deploy/dynamic-page-retirement/index.html)

A correctable error will not impact the application, a non-correctable one will crash the application. The memory page containing the uncorrectable ECC error will be blacklisted and not accessible until the GPU is reset.

If there are page scheduled to be retired you will see something like this in the output of `nvidia-smi -q`:

```
    Retired pages
        Single Bit ECC             : 2
        Double Bit ECC             : 0
        Pending Page Blacklist    : Yes
```

Each retired page decreases the total memory available to applications. But the maximum amount of pages retired amounts to only 4MB in total, so it doesn't reduce the total available GPU memory by much.

To dive even deeper into the GPU debugging, please refer to [this document](https://docs.nvidia.com/deploy/gpu-debug-guidelines/index.html) - it includes a useful triage chart which helps to determine when to RMA GPUs. This document has additional information about Xid 63-like errors

For example it suggests:

> If associated with XID 94, the application that encountered the error needs to be restarted. All other applications on the system can keep running as is until there is a convenient time to reboot for row remapping to activate.
> See below for guidelines on when to RMA GPUs based on row remapping failures.

If after a reboot the same condition occur for the same memory address, it means that memory remapping has failed and Xid 64 will be emitted again. If this continues it means you have a hardware issue that can't be auto-corrected and the GPU needs to RMA'ed.

At other times you may get Xid 63 or 64 and the application will crash. Which usually will generate additional Xid errors, but most of the time it means that the error was uncorrectable (i.e. it was a DBE sort of an error and then it'll be Xid 48).

As mentioned earlier to reset a GPU you can either simply reboot the machine, or run:

```
nvidia-smi -r -i gpu_id
```

where `gpu_id` is the sequential number of the gpu you want to reset, e.g. `0` for the first GPU. Without `-i` all GPUs will be reset.

### uncorrectable ECC error encountered

If you get an error:
```
CUDA error: uncorrectable ECC error encountered
```
as in the previous section, checking the output of `nvidia-smi -q` this time for `ECC Errors` entries will tell which GPU is the problematic one. But if you need to do a quick check in order to recycle a node if it has at least one GPU with this issue, you can just do this:

```
$ nvidia-smi -q | grep -i correctable | grep -v 0
            SRAM Uncorrectable            : 1
            SRAM Uncorrectable            : 5
```
On a good node, this should return nothing, as all counters should be 0. But in the example above we had one broken GPU - there were two entries because the full record was:

```
    ECC Errors
        Volatile
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 1
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
        Aggregate
            SRAM Correctable              : 0
            SRAM Uncorrectable            : 5
            DRAM Correctable              : 0
            DRAM Uncorrectable            : 0
```
The first entry is for `Volatile` (errors counted since the last time the GPU driver reload) and the second is for `Aggregate` (total errors counter for the whole life time of the GPU). In this example we see a Volatile counter for SRAM Uncorrectable errors to be 1 and for the life-time counter it's 5 - that is this is not the first time the GPU runs into this problem.

This typically would correspond to Xid 94 error (see: [Xid Errors](#xid-errors), most likely w/o Xid 48).

To overcome this issue as in the previous section, reset the problematic GPU:
```
nvidia-smi -r -i gpu_id
```
Rebooting the machine will have the same effect.

Now when it comes to Aggregate SRAM Uncorrectable errors, if you have more than 4, that's usually a reason to RMA that GPU.



## Running diagnostics

If you suspect one or mode NVIDIA GPUs are broken on a given node, `dcgmi` is a great tool to quickly find any bad GPUs.

NVIDIA® Data Center GPU Manager (DCGM) is documented [here](https://docs.nvidia.com/datacenter/dcgm/latest/user-guide/index.html) and can be downloaded from [here](https://github.com/NVIDIA/DCGM#quickstart).

Here is an example slurm script that will run very in-depth diagnostics (`-r 3`), which will take about 10 minutes to complete on an 8-GPU node:

```
$ cat dcgmi-1n.slurm
#!/bin/bash
#SBATCH --job-name=dcgmi-1n
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=96
#SBATCH --gres=gpu:8
#SBATCH --exclusive
#SBATCH --output=%x-%j.out

set -x -e
echo "START TIME: $(date)"
srun --output=%x-%j-%N.out dcgmi diag -r 3
echo "END TIME: $(date)"
```

Now to run it on specific nodes of choice:
```
sbatch --nodelist=node-115 dcgmi-1n.slurm
sbatch --nodelist=node-151 dcgmi-1n.slurm
sbatch --nodelist=node-170 dcgmi-1n.slurm
```
edit the nodelist argument to point to the node name to run.

If the node is drained or downed and you can't launch a slurm job using this node, just `ssh` into the node and run the command directly on the node:
```
dcgmi diag -r 3
```
If the diagnostics didn't find any issue, but the application still fails to work, re-run the diagnostics with level 4, which will now take more than 1 hour to complete:
```
dcgmi diag -r 4
```

footnote: apparently silent data corruptions (SDC) can only be detected with `dcgmi diag -r 4` and even then some might be missed. This problem happens occasionally and you may not even be aware that your GPU is messing up the `matmul` at times. I'm pretty sure we had this happen to us, as we were getting weird glitches during training and I spent many days with the NVIDIA team diagnosing the problem, but we failed to do so - eventually the problem disappeared probably because the bad GPU(s) got replaced due to reported failures.

For example, if you run into a repeating Xid 64 error it's likely that the diagnostics report will include:

```
+---------------------------+------------------------------------------------+
| Diagnostic                | Result                                         |
+===========================+================================================+
|-----  Deployment  --------+------------------------------------------------|
| Error                     | GPU 3 has uncorrectable memory errors and row  |
|                           |  remappings are pending                        |
```

so you now know to RMA that problematic GPU, if remapping fails.

But, actually, I found that most of the time `-r 2` already detects faulty GPUs. And it takes just a few minutes to complete. Here is an example of the `-r 2` output on a faulty node:

```
| GPU Memory                | Pass - GPUs: 1, 2, 3, 4, 5, 6, 7               |
|                           | Fail - GPU: 0                                  |
| Warning                   | GPU 0 Thermal violations totaling 13.3 second  |
|                           | s started at 9.7 seconds into the test for GP  |
|                           | U 0 Verify that the cooling on this machine i  |
|                           | s functional, including external, thermal mat  |
|                           | erial interface, fans, and any other componen  |
|                           | ts.
```

The `dcgmi` tool contains various other levels of diagnostics, some of which complete in a matter of a few minutes and can be run as a quick diagnostic in the epilogue of SLURM jobs to ensure that the node is ready to work for the next SLURM job, rather than discovering that after the user started their job and it crashed.

When filing an RMA report you will be asked to run `nvidia-bug-report` script, the output of which you will need to submit with the RMA request.

I usually save the log as well for posterity using one of:
```
dcgmi diag -r 2 | tee -a dcgmi-r2-`hostname`.txt
dcgmi diag -r 3 | tee -a dcgmi-r3-`hostname`.txt
dcgmi diag -r 4 | tee -a dcgmi-r4-`hostname`.txt
```

## How to get the VBIOS info

GPU VBIOS version might be important when researching issues. Let's add the name and bus id to the query, we get:

```
$ nvidia-smi --query-gpu=gpu_name,gpu_bus_id,vbios_version --format=csv
name, pci.bus_id, vbios_version
NVIDIA H100 80GB HBM3, 00000000:04:00.0, 96.00.89.00.01
[...]
NVIDIA H100 80GB HBM3, 00000000:8B:00.0, 96.00.89.00.01
```

Hint: to query for dozens of other things, run:
```
nvidia-smi --help-query-gpu
```

## How to check if your GPU's PCIe generation is supported

Check the PCIe bandwidth reports from the system's boot messages:

```
$ sudo dmesg | grep -i 'limited by'
[   10.735323] pci 0000:04:00.0: 252.048 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x16 link at 0000:01:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
[...]
[   13.301989] pci 0000:8b:00.0: 252.048 Gb/s available PCIe bandwidth, limited by 16.0 GT/s PCIe x16 link at 0000:87:00.0 (capable of 504.112 Gb/s with 32.0 GT/s PCIe x16 link)
```

In this example, as PCIe 5 spec is 504Gbps, you can see that on this node only half of the possible bandwidth is usable, because the PCIe switch is gen4. For PCIe specs see [this](../../../network#pcie).

Since most likely you have [NVLink](../../../network#nvlink) connecting the GPUs to each other, this shouldn't matter for GPU to GPU comms, but it'd slow down any data movement between the GPU and the host, as the data speed is limited by the speed of the slowest link.



## How to check error counters of NVLink links

If you're concerned your NVLink malfunctions you can check its error counters:
```
$ nvidia-smi nvlink -e
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-abcdefab-cdef-abdc-abcd-abababababab)
         Link 0: Replay Errors: 0
         Link 0: Recovery Errors: 0
         Link 0: CRC Errors: 0

         Link 1: Replay Errors: 0
         Link 1: Recovery Errors: 0
         Link 1: CRC Errors: 0

         [...]

         Link 17: Replay Errors: 0
         Link 17: Recovery Errors: 0
         Link 17: CRC Errors: 0
```

Another useful command is:
```
$ nvidia-smi nvlink --status
GPU 0: NVIDIA H100 80GB HBM3 (UUID: GPU-abcdefab-cdef-abdc-abcd-abababababab)
         Link 0: 26.562 GB/s
         [...]
         Link 17: 26.562 GB/s
```
this one tells you the current speed of each link

Run `nvidia-smi nvlink -h` to discover more features (reporting, resetting counters, etc.).


## How to detect if a node is missing GPUs

If you got a new VM, there are odd cases where there is less than expected number of GPUs. Here is how you can quickly test you have got 8 of them:

```
cat << 'EOT' >> test-gpu-count.sh
#!/bin/bash

set -e

# test the node has 8 gpus
test $(nvidia-smi -q | grep UUID | wc -l) != 8 && echo "broken node: less than 8 gpus" && false
EOT
```
and then:

```
bash test-gpu-count.sh
```


## How to detect if you get the same broken node again and again

This is mostly relevant to cloud users who rent GPU nodes.

So you launched a new virtual machine and discovered it has one or more broken NVIDIA GPUs. You discarded it and launched a new and the GPUs are broken again.

Chances are that you're getting the same node with the same broken GPUs. Here is how you can know that.

Before discarding the current node, run and log:

```
$ nvidia-smi -q | grep UUID
    GPU UUID                              : GPU-2b416d09-4537-ecc1-54fd-c6c83a764be9
    GPU UUID                              : GPU-0309d0d1-8620-43a3-83d2-95074e75ec9e
    GPU UUID                              : GPU-4fa60d47-b408-6119-cf63-a1f12c6f7673
    GPU UUID                              : GPU-fc069a82-26d4-4b9b-d826-018bc040c5a2
    GPU UUID                              : GPU-187e8e75-34d1-f8c7-1708-4feb35482ae0
    GPU UUID                              : GPU-43bfd251-aad8-6e5e-ee31-308e4292bef3
    GPU UUID                              : GPU-213fa750-652a-6cf6-5295-26b38cb139fb
    GPU UUID                              : GPU-52c408aa-3982-baa3-f83d-27d047dd7653
```

These UUIDs are unique to each GPU.

When you then re-created your VM, run this command again - if the UUIDs are the same - you know you have the same broken GPUs.

To automate this process so that you always have this data as it'd be too late if you already rebooted the VM, add somewhere in your startup process this:

```
nvidia-smi -q | grep UUID > nvidia-uuids.$(hostname).$(date '+%Y-%m-%d-%H:%M').txt
```

You'd want to save the log file on some persistent filesystem for it to survive reboot. If you do not have one make it local and immediately copy to the cloud. That way it'll always be there when you need it.

Sometimes just rebooting the node will get new hardware. In some situations you get new hardware on almost every reboot, in other situations this doesn't happen. And this behavior may change from one provider to another.

If you keep on getting the same broken node - one trick to overcoming this is allocating a new VM, while holding the broken VM running and when the new VM is running - discarding the broken one. That way you will surely get new GPUs - except there is no guarantee they won't be broken as well. If the use case fits consider getting a static cluster where it's much easier to keep the good hardware.

This method is extra-crucial for when GPUs don't fail right away but after some use so it is non-trivial to see that there is a problem. Even if you reported this node to the cloud provider the technician may not notice the problem right away and put the bad node back into circulation. So if you're not using a static cluster and tend to get random VMs on demand you may want to keep a log of bad UUIDs and know you have got a lemon immediately and not 10 hours into the node's use.

Cloud providers usually have a mechanism of reporting bad nodes. Therefore other than discarding a bad node, it'd help yourself and other users to report bad nodes. Since most of the time users just discard the bad nodes, the next user is going to get them. I have seen users getting a very high percentage of bad nodes in some situations.


## How to get the real GPU utilization metrics

As explained [here](https://arthurchiao.art/blog/understanding-gpu-performance/) the `GPU-Util` column in the `nvidia-smi` output isn't really telling you the GPU Utilization. What it's telling you is the percentage of time during which one or more kernels were executing on the GPU. It's not telling you whether a single SM is being used or all of them. So even if you run a tiny `matmul` all the time, you may get a very high gpu util, while most of the GPU isn't doing anything.

footnote: I have seen GPU util column showing 100% on all gpus when one GPU would stop responding and then whole machinery was blocked waiting for that gpu to respond. Which is how I discovered that it couldn't be showing the real GPU utilization in the first place.

What you want to measure instead is GPU's utilization of the available capacity, otherwise known as "saturation". Alas, this information isn't provided by `nvidia-smi`. In order to get this information you need to install [dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) (which in turn currently requires a recent golang and DCGM (`datacenter-gpu-manager`) and a root access).

Please note that this tool works only high-end data center NVIDIA GPUs, so if you have a consumer level GPU it won't work.

After installing the prerequisites I built the tool:
```
git clone https://github.com/NVIDIA/dcgm-exporter.git
cd dcgm-exporter
make binary
```

And then I was able to get the "real" utilization metrics described in the article with this `dcgm-exporter` config file:

```
$ cat << EOT > dcp-metrics-custom.csv
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM.
DCGM_FI_PROF_PIPE_TENSOR_ACTIVE, gauge, Ratio of cycles the tensor (HMMA) pipe is active.
DCGM_FI_PROF_PIPE_FP16_ACTIVE,   gauge, Ratio of cycles the fp16 pipes are active.
DCGM_FI_PROF_PIPE_FP32_ACTIVE,   gauge, Ratio of cycles the fp32 pipes are active.
EOT
```

Then I launched the daemon (root is required):
```
$ sudo cmd/dcgm-exporter/dcgm-exporter -c 500 -f dcp-metrics-custom.csv
[...]
INFO[0000] Starting webserver
INFO[0000] Listening on                                  address="[::]:9400"
```

`-c 500` refreshes every 0.5sec

and now I was able poll it via:
```
watch -n 0.5 "curl http://localhost:9400/metrics"
```
by running it in one console, and launching a GPU workload in another console. The last column of the output is the utilization of these metrics (where `1.0 == 100%`).

`etc/dcp-metrics-included.csv` from the repo contains all the available metrics, so you can add more metrics.

This is a quick way of doing that, but the intention is to use it with [Prometheus](https://prometheus.io/) which will give you nice charts. E.g. the article included an example where you can see the SM occupancy, Tensor core, FP16 and FP32 Core utilization in the second row of the charts:

![dcgm-metrics](images/dcgm-metrics.png)

([source](https://arthurchiao.art/blog/understanding-gpu-performance/))

For completion here is an example from the same article showing a 100% gpu util with a CUDA kernel that is doing absolutely nothing compute-wise other than occupying a single Streaming Multiprocessor (SM):

```
$ cat << EOT > 1_sm_kernel.cu
__global__ void simple_kernel() {
    while (true) {}
}

int main() {
    simple_kernel<<<1, 1>>>();
    cudaDeviceSynchronize();
}
EOT
```

Let's compile it:
```
nvcc 1_sm_kernel.cu -o 1_sm_kernel
```
And now run it in console A:
```
$ ./1_sm_kernel
```
and in console B:
```
$ nvidia-smi
Tue Oct  8 09:49:34 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.90.12              Driver Version: 550.90.12      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          Off |   00000000:01:00.0 Off |                    0 |
| N/A   32C    P0             69W /  300W |     437MiB /  81920MiB |    100%      Default |
|                                         |                        |             Disabled |
```

You can see the `100%` GPU-Util. So here 1 SM is used whereas A100-80GB PCIe has 132 SMs! And it's not even doing any compute as it just runs an infinite loop of doing nothing.


================================================
FILE: compute/cpu/README.md
================================================
# CPU

As of this writing Machine learning workloads don't use much CPU so there aren't too many things to tell in this chapter. As CPUs evolve to become more like GPUs this is like to change, so I'm expecting this chapter to evolve along the evolution of the CPUs.

## How many cpu cores do you need

Per 1 accelerator you need:

1. 1 cpu core per process that is tied to the accelerator
2. 1 cpu core for each `DataLoader` worker process - and typically you need 2-4 workers.

2 workers is usually plenty for LMs, especially if the data is already preprocessed.

If you need to do dynamic transforms, which is often the case with computer vision models or VLMs, you may need 3-4 and sometimes more workers.

The goal is to be able to pull from the `DataLoader` instantly, and not block the accelerator's compute, which means that you need to pre-process a bunch of samples for the next iteration, while the current iteration is running. In other words your next batch needs to take no longer than a single iteration accelerator compute of the batch of the same size.

Besides preprocessing if you're pulling dynamically from the cloud instead of local storage you also need to make sure that the data is pre-fetched fast enough to feed the workers that feed the accelerator furnace.

Multiply that by the number of accelerators, add a few cores for the Operation system (let's say 4).

If the node has 8 accelerators, and you have `num_workers`, then you need `8*(num_workers+1)+4`. If you're doing NLP, it'd be usually about 2 workers per accelerator, so `8*(2+1)+4` => 28 cpu cores. If you do CV training, and, say, you need 4 workers per accelerator, then it'd be `8(4+1)+4` => 44 cpu cores.

What happens if you have more very active processes than the total number of cpu cores? Some processes will get preempted (put in the queue for when cpu cores become available) and you absolutely want to avoid any context switching.

But modern cloud offerings typically have 50-100+ cpu-cores so usually there is no problem to have enough cores to go around.

See also [Asynchronous DataLoader](../../training/performance#asynchronous-dataloader).



### CPU offload

Some frameworks, like [Deepspeed](https://www.deepspeed.ai/tutorials/zero-offload/) can offload some compute work to CPU without creating a bottleneck. In which case you'd want additional cpu-cores.



## NUMA affinity

See [NUMA affinity](../../training/performance#numa-affinity).



## Hyperthreads

[Hyper-Threads](https://en.wikipedia.org/wiki/Hyper-threading) double the cpu cores number, by virtualizing each physical core into 2 virtual ones, allowing 2 threads to use the same cpu core at the same time. Depending on the type of workload this feature may or may not increase the overall performance. Intel, the inventor of this technology, suggests a possible 30% performance increase in some situations.

See also [To enable Hyper-Threads or not](../../orchestration/slurm/performance.md#to-enable-hyper-threads-or-not).


================================================
FILE: compute/cpu-memory/README.md
================================================
# CPU memory

This is a tiny chapter, since usually there are very few nuances one needs to know about CPU memory - which is a good thing!

Most of the ML workload compute happens on GPUs, but typically there should be at least as much CPU memory on each node as there is on the GPUs. So, for example, if you're on a H100 node with 8x 80GB GPUs, you have 640GB of GPU memory. Thus you want at least as much of CPU memory. Most recent high end cloud packages usually come with 1-2TBs of CPU memory.

## What CPU memory is needed for in ML workloads

- Loading the model weights, unless they are loaded directly onto the GPUs - this is usually a transitory memory usage that goes back to zero once the model has been moved to GPUs.
- Saving the model weights. In some situations each GPU writes its own checkpoint directly to the disk, in other cases the model is recomposed on the CPU before it's written to disk - this too is a transitory memory usage.
- Possible parameter and optimizer state offloading when using frameworks like  [Deepspeed](https://www.deepspeed.ai/tutorials/zero-offload/). In which case quite a lot of CPU memory might be needed.
- Activations calculated in the `forward` pass, and which need to be available for the `backward` path can also be offloaded to CPU, rather than discarded and then recomputed during the backward pass to save the unnecessary overhead
- `DataLoader` is usually one of the main users of CPU memory and at times it may consume very large amounts of memory. Typically there are at least 2x 8 DL workers running on each node, so you need enough memory to support at least 16 processes each holding some data. For example, in the case of streaming data from the cloud, if the data shards are large, these processes could easily eat up hundreds of GBs of CPU memory.
- The software itself and its dependent libraries uses a bit of CPU memory, but this amount is usually negligible.

## Things to know

- If the `DataLoader` uses HF `datasets` in `mmap` mode the Resident memory usage may appear to be using a huge amount of CPU memory as it'll try to map out the whole datasets to the memory. Except this is misleading, since if the memory is needed elsewhere the OS will page out any unneeded mmap'ed pages back to the system. You can read more about it [here](https://stasosphere.com/entrepreneur-being/301-mmap-memory-leak-investigation/). This awareness, of course, applies to any dataset using `mmap`, I was using HF `datasets` as an example since it's very widely used.


================================================
FILE: contributors.md
================================================
# Contributors

Multiple contributors kindly helped to improve these ever improving and expanding notes.

1. Some of them did it via PRs, and are thus listed automatically [here](https://github.com/stas00/ml-engineering/graphs/contributors)
2. Others did it via various other ways so I'm listing them explicitly here:

- [Adam Moody](https://github.com/adammoody)
- [Alex Rogozhnikov](https://github.com/arogozhnikov)
- [Bowei Liu](https://github.com/boweiliu)
- [Darrick Horton](https://www.linkedin.com/in/darrick-horton/)
- [Elio VP](https://www.linkedin.com/in/eliovp/)
- [Garrett Goon](https://github.com/garrett361)
- [Horace He](https://github.com/Chillee)
- [Ivan Yashchuk](https://github.com/IvanYashchuk)
- [Jack Dent](https://github.com/jackdent)
- [Jon Stevens](https://github.com/jon-hotaisle)
- [Jordan Nanos](https://github.com/JordanNanos)
- [Mark Saroufim](https://github.com/msaroufim)
- [Olatunji Ruwase](https://github.com/tjruwase)
- Oren Leung
- [Quentin Anthony](https://github.com/Quentin-Anthony)
- [Ross Wightman](https://github.com/rwightman)
- [Samyam Rajbhandari](https://github.com/samyam)
- [Shikib Mehri](https://github.com/Shikib)
- [Siddharth Singh](https://github.com/siddharth9820)
- [Stéphane Requena](https://twitter.com/s_requena)
- [Zhiqi Tao](https://www.linkedin.com/in/zhiqitao/)


If you contributed to this text and for some reason you're not on one of these 2 lists - let's fix it by adding your name with a github or similar link here.


================================================
FILE: debug/NicerTrace.py
================================================
""" NicerTrace - an improved Trace package """

"""
To try it in action and to get a sense of how it can help you just run:
python trace/NicerTrace.py
"""


import datetime
import os
import socket
import sys
import sysconfig
import time
import trace


class NicerTrace(trace.Trace):
    # as the 2 paths overlap the longer with site-packages needs to be first
    py_dirs = [sysconfig.get_paths().get(k) for k in ["purelib", "stdlib"]]
    site_packages_dir = sysconfig.get_paths()["purelib"]
    stdlib_dir = sysconfig.get_paths()["stdlib"]

    def __init__(self, *args, packages_to_include=None, log_pids=False, **kwargs):
        """normal init plus added package/dir exclusion overrides:

        While preserving the original behavior a new optional arg is added `packages_to_include`
        with the following behavior:

        1. if ignoredirs is a list the original trace behavior is used - only those dirs and subdirs will be excluded
        2. if ignoredirs is None and packages_to_include is None - everything is included
        3. if packages_to_include="uninstalled" all packages found under  /.../site-packages will be excluded. I couldn't find a way to exclude core python packages under /.../lib/python3.8 since it'd then exclude site-packages as well
        3. if packages_to_include=["PIL", "numpy", "pytorch"] all packages found under  /.../site-packages, and /.../lib/python3.8 will be excluded except the packages that were listed to be included - use top-level package name here
        4. if packages_to_include=None, everything under /.../site-packages, and /.../lib/python3.8 will be excluded and any packages that are installed via `pip install -e .` will be included

        """
        ignoredirs = kwargs.get("ignoredirs", None)

        if ignoredirs is not None and len(ignoredirs) > 1:
            if packages_to_include is not None:
                raise ValueError("can't have both ignoredirs and packages_to_include not None")
            kwargs["ignoredirs"] = ignoredirs
        elif packages_to_include is None:
            kwargs["ignoredirs"] = None
        elif packages_to_include == "uninstalled":
            kwargs["ignoredirs"] = self.stdlib_dir  # everything including python core packages
        else:
            # exclude all of /.../lib/python3.8 and sub-paths from /.../site-packages, and
            packages = os.listdir(self.site_packages_dir)
            packages_to_exclude = set(packages) - set(packages_to_include)
            dirs_to_exclude = [
                f"{self.site_packages_dir}/{dir}" for dir in sorted(packages_to_exclude) if not dir.endswith("-info")
            ]
            # note, no way to exclude python core packages in this situation because
            # sysconfig.get_paths()'s' purelib is a subset of stdlib :(, so excluding only site-packages
            kwargs["ignoredirs"] = dirs_to_exclude

        # not packages, but final module names like Image from Image.py
        # mods_to_exclude = []

        # print("\n".join(kwargs["ignoredirs"]))

        super().__init__(*args, **kwargs)
        self.log_pids = log_pids

    def strip_py_dirs(self, path):
        """strips python path prefix like /.../site-packages, and /.../lib/python3.8 if any matches"""
        for prefix in self.py_dirs:
            if path.startswith(prefix):
                return path.replace(prefix + "/", "")
        return path

    def globaltrace_lt(self, frame, why, arg):
        """Handler for call events.
        If the code block being entered is to be ignored, returns `None',
        else returns self.localtrace.

        This is an override to properly show full package names:
        1. if it's under site-packages or core python dir - convert to package name
        2. otherwise show full path to the python file - usually uninstalled packages

        Additionally enter frames now include the line number since some packages have multiple
        methods that have the same name and there is no telling which one of them was called.

        It was written against https://github.com/python/cpython/blob/3.8/Lib/trace.py. If you're
        using a different python version you may have to adapt it should the core implementation
        change (but it's unlikely)

        """
        if why == "call":
            code = frame.f_code
            # print(f"\n\n{frame.f_code=}")
            # print(dir(code))

            filename = frame.f_globals.get("__file__", None)
            if filename:
                lineno = code.co_firstlineno
                # python's trace fails to get the full package name - let's fix it
                # strip the common path of python library
                modulename = self.strip_py_dirs(filename)
                if filename != modulename:
                    # the package was installed under /.../site-packages, /.../lib/python3.8
                    modulename, ext = os.path.splitext(modulename)
                    modulename = modulename.replace("/", ".")
                else:
                    # still full path, because the package is not installed
                    modulename = filename

                if modulename is not None:
                    # XXX: ignoremods may not work now as before
                    ignore_it = self.ignore.names(filename, modulename)
                    if not ignore_it:
                        if self.trace:
                            if self.log_pids:
                                print(os.getpid(), end=" ")

                            print(f"        {modulename}:{lineno} {code.co_name}")
                        return self.localtrace
            else:
                return None

    def localtrace_trace_and_count(self, frame, why, arg):
        """
        Overriding the default method.

        Using hh:mm:ss format for timestamps (instead of secs) as it's more readable when the trace is run for hours

        XXX: ideally it would be nice not to repeat the same module name on every line, but when I tried
        that I discovered that globaltrace_lt doesn't necessarily frame all the local calls, since
        localtrace_trace_and_count may continue printing local calls from an earlier frame w/o
        notifying that the context has changed. So we are forced to reprint the module name on each
        line to keep at least the incomplete context.

        Ideally there should an indication of a frame change before all the local prints

        Read the disclaimer in globaltrace_lt that this was tested with py-3.8

        """
        if why == "line":
            # record the file name and line number of every trace
            filename = frame.f_code.co_filename
            lineno = frame.f_lineno
            key = filename, lineno
            self.counts[key] = self.counts.get(key, 0) + 1
            basename = os.path.basename(filename)
            if self.log_pids:
                print(os.getpid(), end=" ")
            if self.start_time:
                delta_time = trace._time() - self.start_time
                delta_time = str(datetime.timedelta(seconds=delta_time)).split(".")[0]
                print(delta_time, end=" ")
            print(f"{basename}:{lineno:>6}: {trace.linecache.getline(filename, lineno)}", end="")
        return self.localtrace

# -------------------------------- #


class Tee:
    """
    A helper class to tee print's output into a file.
    Usage:
    sys.stdout = Tee(filename)
    """

    def __init__(self, filename):
        self.stdout = sys.stdout
        self.file = open(filename, "a")

    def __getattr__(self, attr):
        return getattr(self.stdout, attr)

    def write(self, msg):
        # comment out the next line if you don't want to write to stdout
        self.stdout.write(msg)
        self.file.write(msg)
        self.file.flush()

    def flush(self):
        # comment out the next line if you don't want to write to stdout
        self.stdout.flush()
        self.file.flush()


# -------------------------------- #

import time

from PIL import Image

def main():
    img = Image.new("RGB", (4, 4))
    time.sleep(1)
    img1 = img.convert("RGB")

# or if you want to try another version of main:

# from transformers import AutoConfig
# def main():
#     c = AutoConfig.from_pretrained("t5-small")

if __name__ == "__main__":
    # enable the trace
    if 1:
        cwd = os.path.realpath(".")
        pid = os.getpid()
        hostname = socket.gethostname()
        local_rank = int(os.environ.get("LOCAL_RANK", 0))
        trace_output_file = f"{cwd}/trace-{hostname}-{local_rank}-{pid}.txt"

        # run the new command using the given tracer
        sys.stdout = Tee(trace_output_file)

        # create a Trace object, telling it what to ignore, and whether to
        # do tracing or line-counting or both.
        # tracer = trace.Trace(
        tracer = NicerTrace(
            # ignoredirs=dirs_to_exclude, # don't set this one if you use packages_to_include
            # ignoremods=mods_to_exclude,
            trace=1,
            count=1,
            timing=True,
            # log_pids=True, useful if you fork workers and want to tell which process the trace belongs to
            packages_to_include=["PIL"],
        )

        # string with commands to run - passed to exec()
        tracer.run("main()")
        # or to use the function interface to call main with args, kwargs
        # tracer.runfunc(main, *args, **kwds))
    else:
        main()


================================================
FILE: debug/README.md
================================================
# Debugging and Troubleshooting


## Guides

- [Debugging PyTorch programs](./pytorch.md)

- [Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs](./torch-distributed-hanging-solutions.md)

- [Network Debug](../network/debug/)

- [Troubleshooting NVIDIA GPUs](../compute/accelerator/nvidia/debug.md)

- [Underflow and Overflow Detection](./underflow_overflow.md)



## Tools

- [Debug Tools](./tools.md)

- [torch-distributed-gpu-test.py](./torch-distributed-gpu-test.py) - this a `torch.distributed` diagnostics
  script that checks that all GPUs in the cluster (one or many nodes) can talk to each other and allocate gpu memory.

- [NicerTrace](./NicerTrace.py) - this is an improved `trace` python module with multiple additional flags added to the constructor and more useful output.


================================================
FILE: debug/make-tiny-models-tokenizers-datasets.md
================================================
# Faster debug and development with tiny models, tokenizers and datasets

If you're debugging problems and develop with full sized models and tokenizers you're likely not working in a very efficient way. Not only it's much more difficult to solve problem, the amount of waiting to get the program to restart and to get to the desirable point can be huge - and cumulatively this can be a huge drain on one's motivation and productivity, not talking about the resolution taking much longer, if at all.

The solution is simple:

**Unless you're testing the quality of a model, always use a tiny random model with potentially tiny tokenizer.**

Moreover, large models often require massive resources, which are typically expensive and can also can make a debugging process super complicated. For example any debugger can handle a single process, but if your model doesn't fit and require some sort of [parallelization](../training/model-parallelism) that requires multiple processes - most debuggers will either break or have issue giving you what you need. The ideal development environment is one process and a tiny model is guaranteed to fit on an even cheapest single smallest consumer GPU available. You could even use the free [Google Colab](https://colab.research.google.com/) to do development in a pinch if you have no GPUs around.

So the updated ML development mantra then becomes:

- the larger the model the better the final product generates
- the smaller the model the quicker the final product's training can be started

footnote: the recent research shows that larger isn't always better, but it's good enough to convey the importance of my communication.

Once your code is working, do switch to the real model to test the quality of your generation. But even in this case still try first the smallest model that produces a quality result. Only when you can see that the generation is mostly right use the largest model to validate if your work has been perfect.

## Making a tiny model

Important: given their popularity and the well designed simple API I will be discussing HF [`transformers`](https://github.com/huggingface/transformers/) models. But the same principle can be applied to any other model.

TLDR: it's trivial to make a tiny HF `transformers` model:

1. Fetch the config object of a full size model
2. Shrink the hidden size and perhaps a few other parameters that contribute to the bulk of the model
3. Create a model from that shrunken config
4. Save this model. Done!

footnote: It's critical to remember that this will generate a random model, so don't expect any quality from its output.

footnote: These notes were written with HF Transformers models in mind. If you're using a different modeling library you may have to adapt some of these things.

Now let's go through the actual code and convert ["google/mt5-small"](https://huggingface.co/google/mt5-small/tree/main) into its tiny random counterpart.

```
from transformers import MT5Config, MT5ForConditionalGeneration

mname_from = "google/mt5-small"
mname_very_small = "mt5-tiny-random"

config = MT5Config.from_pretrained(mname_from)

config.update(dict(
    d_model=64,
    d_ff=256,
))
print("new config", config)

very_small_model = MT5ForConditionalGeneration(config)
print(f"num of params {very_small_model.num_parameters()}")

very_small_model.save_pretrained(mname_very_small)
```

As you can see it's trivial to do. And you can make it even smaller if you don't need the hidden size to be at least 64. For example try 8 - you just need to make sure that the number of attention heads isn't larger than hidden size.

Also please note that you don't need any GPUs to do that and you could do this even on a huge 176B parameter model like [BLOOM-176B](https://huggingface.co/bigscience/bloom). Since you never load the actual original model, except its config object.

Before modifying the config you can dump the original parameters and choose to shrinks more dimensions. For example, using less layers makes it even smaller and easier to debug. So here is what you can do instead:

```
config.update(dict(
    d_mode
Download .txt
gitextract_lov69rop/

├── .gitignore
├── LICENSE-CC-BY-SA
├── Makefile
├── README.md
├── build/
│   ├── README.md
│   ├── linkcheckerrc
│   ├── mdbook/
│   │   ├── md-to-html.py
│   │   ├── mv-links.py
│   │   ├── preprocess-html-for-epub.py
│   │   └── utils/
│   │       ├── build_utils.py
│   │       └── github_md_utils.py
│   ├── prince_style.css
│   └── requirements.txt
├── chapters-md.txt
├── compute/
│   ├── README.md
│   ├── accelerator/
│   │   ├── README.md
│   │   ├── amd/
│   │   │   ├── debug.md
│   │   │   └── performance.md
│   │   ├── benchmarks/
│   │   │   ├── README.md
│   │   │   └── mamf-finder.py
│   │   └── nvidia/
│   │       └── debug.md
│   ├── cpu/
│   │   └── README.md
│   └── cpu-memory/
│       └── README.md
├── contributors.md
├── debug/
│   ├── NicerTrace.py
│   ├── README.md
│   ├── make-tiny-models-tokenizers-datasets.md
│   ├── nccl-performance-debug.md
│   ├── pytorch.md
│   ├── tiny-scripts/
│   │   ├── README.md
│   │   ├── c4-en-10k.py
│   │   ├── cm4-synthetic-testing.py
│   │   ├── fsmt-make-super-tiny-model.py
│   │   ├── general-pmd-ds-unpack.py
│   │   ├── general-pmd-synthetic-testing.py
│   │   ├── idefics-make-tiny-model.py
│   │   ├── m4-ds-unpack.py
│   │   ├── mt5-make-tiny-model.py
│   │   ├── openwebtext-10k.py
│   │   └── oscar-en-10k.py
│   ├── tools.md
│   ├── torch-distributed-gpu-test.py
│   ├── torch-distributed-hanging-solutions.md
│   ├── underflow_overflow.md
│   └── underflow_overflow.py
├── inference/
│   └── README.md
├── insights/
│   ├── ai-battlefield.md
│   └── how-to-choose-cloud-provider.md
├── model-parallelism/
│   └── README.md
├── network/
│   ├── README.md
│   ├── benchmarks/
│   │   ├── README.md
│   │   ├── all_gather_object_vs_all_gather.py
│   │   ├── all_gather_object_vs_all_reduce.py
│   │   ├── all_reduce_bench.py
│   │   ├── all_reduce_bench_pyxis.sbatch
│   │   ├── all_reduce_latency_comp.py
│   │   └── results/
│   │       ├── README.md
│   │       └── disable-nvlink.md
│   ├── comms.md
│   └── debug/
│       └── README.md
├── orchestration/
│   ├── README.md
│   ├── kubernetes/
│   │   └── README.md
│   └── slurm/
│       ├── README.md
│       ├── admin.md
│       ├── cron-daily.slurm
│       ├── cron-hourly.slurm
│       ├── example.slurm
│       ├── launchers/
│       │   ├── README.md
│       │   ├── accelerate-launcher.slurm
│       │   ├── lightning-launcher.slurm
│       │   ├── srun-launcher.slurm
│       │   └── torchrun-launcher.slurm
│       ├── performance.md
│       ├── undrain-good-nodes.sh
│       └── users.md
├── resources/
│   └── README.md
├── stabs/
│   ├── README.md
│   └── incoming.md
├── storage/
│   ├── README.md
│   ├── benchmarks/
│   │   └── results/
│   │       └── hope-2023-12-20-14-37-02-331702-summary.md
│   ├── fio-json-extract.py
│   └── fio-scan
├── testing/
│   ├── README.md
│   └── testing_utils.py
├── todo.md
└── training/
    ├── README.md
    ├── checkpoints/
    │   ├── README.md
    │   ├── torch-checkpoint-convert-to-bf16
    │   └── torch-checkpoint-shrink.py
    ├── datasets.md
    ├── dtype.md
    ├── emulate-multi-node.md
    ├── fault-tolerance/
    │   ├── README.md
    │   ├── fs-watchdog.py
    │   ├── fs-watchdog.slurm
    │   ├── slurm-status.py
    │   └── slurm-status.slurm
    ├── hparams.md
    ├── instabilities/
    │   ├── README.md
    │   └── training-loss-patterns.md
    ├── model-parallelism/
    │   └── README.md
    ├── performance/
    │   ├── README.md
    │   ├── benchmarks/
    │   │   ├── activation-memory-per-layer.py
    │   │   ├── dataloader/
    │   │   │   ├── num-workers-bench.py
    │   │   │   └── pin-memory-non-block-bench.py
    │   │   ├── matrix-shape/
    │   │   │   └── swiglu-maf-bench.py
    │   │   └── numa/
    │   │       ├── numa-set-pynvml.py
    │   │       └── numa-set.sh
    │   └── distributed/
    │       └── torch-dist-mem-usage.py
    ├── re-train-hub-models.md
    ├── reproducibility/
    │   └── README.md
    └── tools/
        ├── main_process_first.py
        ├── multi-gpu-non-interleaved-print.py
        └── printflock.py
Download .txt
SYMBOL INDEX (257 symbols across 32 files)

FILE: build/mdbook/md-to-html.py
  function convert_markdown_to_html (line 21) | def convert_markdown_to_html(markdown_path, args):
  function make_cover_page_file (line 38) | def make_cover_page_file(cover_md_file, date):
  function write_html_index (line 51) | def write_html_index(html_chapters_file, markdown_files):

FILE: build/mdbook/mv-links.py
  function rewrite_links (line 18) | def rewrite_links(markdown_path, src, dst):

FILE: build/mdbook/preprocess-html-for-epub.py
  function file_to_prefix (line 16) | def file_to_prefix(filepath):
  function prefix_anchors_in_file (line 26) | def prefix_anchors_in_file(html_file, prefix):
  function update_links_in_file (line 53) | def update_links_in_file(html_file, file_prefix_map, file_anchor_maps):
  function main (line 130) | def main():

FILE: build/mdbook/utils/build_utils.py
  function get_markdown_files (line 3) | def get_markdown_files(md_chapters_file):

FILE: build/mdbook/utils/github_md_utils.py
  function md_is_relative_link (line 37) | def md_is_relative_link(link):
  function md_process_local_links (line 44) | def md_process_local_links(para, callback, **kwargs):
  function md_link_break_up (line 54) | def md_link_break_up(text):
  function md_link_build (line 78) | def md_link_build(link_text, link, anchor=None):
  function resolve_rel_link (line 89) | def resolve_rel_link(link, cwd_rel_path):
  function md_expand_links (line 98) | def md_expand_links(text, cwd_rel_path, repo_url=""):
  function md_rename_relative_links (line 141) | def md_rename_relative_links(text, cwd_rel_path, src, dst):
  function md_convert_md_target_to_html (line 200) | def md_convert_md_target_to_html(text):
  function md_header_to_anchor (line 211) | def md_header_to_anchor(text):
  function md_header_to_md_link (line 228) | def md_header_to_md_link(text, link=''):

FILE: compute/accelerator/benchmarks/mamf-finder.py
  function get_torch_dtype (line 54) | def get_torch_dtype(dtype_str):
  class Arch (line 65) | class Arch:
    method __init__ (line 66) | def __init__(self):
    method __repr__ (line 69) | def __repr__(self):
  class CUDAArch (line 72) | class CUDAArch(Arch):
    method __init__ (line 74) | def __init__(self):
    method device (line 81) | def device(self):
    method name (line 85) | def name(self):
    method device_info (line 89) | def device_info(self):
    method compute_info (line 93) | def compute_info(self):
    method event (line 99) | def event(self, enable_timing=True):
    method synchronize (line 102) | def synchronize(self):
  class HPUArch (line 105) | class HPUArch(Arch):
    method __init__ (line 107) | def __init__(self):
    method device (line 111) | def device(self):
    method name (line 115) | def name(self):
    method device_info (line 119) | def device_info(self):
    method compute_info (line 123) | def compute_info(self):
    method event (line 126) | def event(self, enable_timing=True):
    method synchronize (line 129) | def synchronize(self):
  class XPUArch (line 132) | class XPUArch(Arch):
    method __init__ (line 134) | def __init__(self):
    method device (line 138) | def device(self):
    method name (line 142) | def name(self):
    method device_info (line 146) | def device_info(self):
    method compute_info (line 150) | def compute_info(self):
    method event (line 153) | def event(self, enable_timing=True):
    method synchronize (line 156) | def synchronize(self):
  class MPSEvent (line 159) | class MPSEvent:
    method __init__ (line 161) | def __init__(self):
    method record (line 164) | def record(self):
    method elapsed_time (line 168) | def elapsed_time(self, other):
  class MPSArch (line 173) | class MPSArch(Arch):
    method __init__ (line 175) | def __init__(self):
    method device (line 179) | def device(self):
    method name (line 183) | def name(self):
    method device_info (line 187) | def device_info(self):
    method compute_info (line 191) | def compute_info(self):
    method event (line 203) | def event(self, enable_timing=True):
    method synchronize (line 206) | def synchronize(self):
  function get_accelerator_arch (line 209) | def get_accelerator_arch():
  class Tee (line 235) | class Tee(object):
    method __init__ (line 236) | def __init__(self, filename, verbose):
    method write (line 243) | def write(self, message):
    method flush (line 251) | def flush(self):
  function print_benchmark_header (line 257) | def print_benchmark_header(dtype, device, notes="None"):
  function benchmark_mm (line 290) | def benchmark_mm(m, n, k, dtype, device, num_iterations, num_warmup_iter...
  function setup_checks (line 373) | def setup_checks():
  function sigkill_handler (line 437) | def sigkill_handler(signum, frame):
  function finish (line 449) | def finish():

FILE: debug/NicerTrace.py
  class NicerTrace (line 18) | class NicerTrace(trace.Trace):
    method __init__ (line 24) | def __init__(self, *args, packages_to_include=None, log_pids=False, **...
    method strip_py_dirs (line 66) | def strip_py_dirs(self, path):
    method globaltrace_lt (line 73) | def globaltrace_lt(self, frame, why, arg):
    method localtrace_trace_and_count (line 122) | def localtrace_trace_and_count(self, frame, why, arg):
  class Tee (line 158) | class Tee:
    method __init__ (line 165) | def __init__(self, filename):
    method __getattr__ (line 169) | def __getattr__(self, attr):
    method write (line 172) | def write(self, msg):
    method flush (line 178) | def flush(self):
  function main (line 190) | def main():

FILE: debug/tiny-scripts/c4-en-10k.py
  class C4En10k (line 43) | class C4En10k(datasets.GeneratorBasedBuilder):
    method _info (line 54) | def _info(self):
    method _split_generators (line 62) | def _split_generators(self, dl_manager):
    method _generate_examples (line 69) | def _generate_examples(self, jsonl_file):

FILE: debug/tiny-scripts/cm4-synthetic-testing.py
  class CM4Synthetic (line 79) | class CM4Synthetic(datasets.GeneratorBasedBuilder):
    method _info (line 89) | def _info(self):
    method _split_generators (line 118) | def _split_generators(self, dl_manager):
    method _generate_examples (line 132) | def _generate_examples(self, data_path):

FILE: debug/tiny-scripts/general-pmd-ds-unpack.py
  function list2range (line 48) | def list2range(s):
  function unpack (line 57) | def unpack(args, idx, row):
  function dump_example_shapes (line 82) | def dump_example_shapes(idx, row):

FILE: debug/tiny-scripts/general-pmd-synthetic-testing.py
  class GeneralPMDSynthetic (line 79) | class GeneralPMDSynthetic(datasets.GeneratorBasedBuilder):
    method _info (line 89) | def _info(self):
    method _split_generators (line 110) | def _split_generators(self, dl_manager):
    method _generate_examples (line 124) | def _generate_examples(self, data_path):

FILE: debug/tiny-scripts/m4-ds-unpack.py
  function list2range (line 50) | def list2range(s):
  function unpack (line 59) | def unpack(args, idx, row):
  function dump_example_shapes (line 78) | def dump_example_shapes(idx, row):

FILE: debug/tiny-scripts/openwebtext-10k.py
  class Openwebtext10k (line 44) | class Openwebtext10k(datasets.GeneratorBasedBuilder):
    method _info (line 55) | def _info(self):
    method _split_generators (line 63) | def _split_generators(self, dl_manager):
    method _generate_examples (line 85) | def _generate_examples(self, txt_files):

FILE: debug/tiny-scripts/oscar-en-10k.py
  class OscarEn10k (line 49) | class OscarEn10k(datasets.GeneratorBasedBuilder):
    method _info (line 60) | def _info(self):
    method _split_generators (line 68) | def _split_generators(self, dl_manager):
    method _generate_examples (line 75) | def _generate_examples(self, jsonl_file):

FILE: debug/torch-distributed-gpu-test.py
  function print (line 65) | def print(*args, **kwargs):

FILE: debug/underflow_overflow.py
  class ExplicitEnum (line 19) | class ExplicitEnum(str, Enum):
    method _missing_ (line 25) | def _missing_(cls, value):
  class DebugUnderflowOverflow (line 30) | class DebugUnderflowOverflow:
    method __init__ (line 148) | def __init__(self, model, max_frames_to_save=21, trace_batch_nums=[], ...
    method save_frame (line 165) | def save_frame(self, frame=None):
    method expand_frame (line 171) | def expand_frame(self, line):
    method trace_frames (line 174) | def trace_frames(self):
    method reset_saved_frames (line 178) | def reset_saved_frames(self):
    method dump_saved_frames (line 181) | def dump_saved_frames(self):
    method analyse_model (line 189) | def analyse_model(self):
    method analyse_variable (line 197) | def analyse_variable(self, var, ctx):
    method batch_start_frame (line 207) | def batch_start_frame(self):
    method batch_end_frame (line 211) | def batch_end_frame(self):
    method create_frame (line 214) | def create_frame(self, module, input, output):
    method register_forward_hook (line 242) | def register_forward_hook(self):
    method _register_forward_hook (line 245) | def _register_forward_hook(self, module):
    method forward_hook (line 248) | def forward_hook(self, module, input, output):
  function get_abs_min_max (line 296) | def get_abs_min_max(var, ctx):
  function detect_overflow (line 301) | def detect_overflow(var, ctx):
  class DebugOption (line 347) | class DebugOption(ExplicitEnum):

FILE: network/benchmarks/all_gather_object_vs_all_gather.py
  function all_gather_object (line 33) | def all_gather_object():
  function all_gather (line 39) | def all_gather():

FILE: network/benchmarks/all_gather_object_vs_all_reduce.py
  function all_gather_object (line 28) | def all_gather_object():
  function all_reduce (line 34) | def all_reduce():

FILE: network/benchmarks/all_reduce_bench.py
  function get_device_info (line 134) | def get_device_info():
  function plot_averages (line 142) | def plot_averages(path, x, y, ranks):
  function plot_profile (line 174) | def plot_profile(path, y, ranks):
  function timed_allreduce (line 207) | def timed_allreduce(tensor, size, start_event, end_event):
  function run (line 225) | def run(local_rank):
  function device_id_kwargs (line 332) | def device_id_kwargs(local_rank):
  function parse_args (line 348) | def parse_args():

FILE: network/benchmarks/all_reduce_latency_comp.py
  function timed_allreduce (line 20) | def timed_allreduce(mat, repeat_times, id, start_event, end_event):
  function run (line 52) | def run(local_rank):
  function init_processes (line 74) | def init_processes(local_rank, fn, backend='nccl'):

FILE: testing/testing_utils.py
  function is_torch_available (line 50) | def is_torch_available():
  function parse_flag_from_env (line 54) | def parse_flag_from_env(key, default=False):
  function parse_int_from_env (line 70) | def parse_int_from_env(key, default=None):
  function require_torch (line 83) | def require_torch(test_case):
  function require_torch_no_gpus (line 96) | def require_torch_no_gpus(test_case):
  function require_torch_multi_gpu (line 110) | def require_torch_multi_gpu(test_case):
  function require_torch_non_multi_gpu (line 128) | def require_torch_non_multi_gpu(test_case):
  function require_torch_up_to_2_gpus (line 143) | def require_torch_up_to_2_gpus(test_case):
  function require_torch_gpu (line 165) | def require_torch_gpu(test_case):
  function is_deepspeed_available (line 173) | def is_deepspeed_available():
  function require_deepspeed (line 177) | def require_deepspeed(test_case):
  function is_bnb_available (line 187) | def is_bnb_available():
  function require_bnb (line 191) | def require_bnb(test_case):
  function require_bnb_non_decorator (line 203) | def require_bnb_non_decorator():
  function set_seed (line 211) | def set_seed(seed: int = 42):
  function get_gpu_count (line 226) | def get_gpu_count():
  function torch_assert_equal (line 238) | def torch_assert_equal(actual, expected, **kwargs):
  function torch_assert_close (line 248) | def torch_assert_close(actual, expected, **kwargs):
  function is_torch_bf16_available (line 263) | def is_torch_bf16_available():
  function require_torch_bf16 (line 281) | def require_torch_bf16(test_case):
  function get_tests_dir (line 289) | def get_tests_dir(append_path=None):
  function parameterized_custom_name_func_join_params (line 308) | def parameterized_custom_name_func_join_params(func, param_num, param):
  function apply_print_resets (line 344) | def apply_print_resets(buf):
  class CaptureStd (line 347) | class CaptureStd:
    method __init__ (line 401) | def __init__(self, out=True, err=True, replay=True):
    method __enter__ (line 420) | def __enter__(self):
    method __exit__ (line 433) | def __exit__(self, *exc):
    method __repr__ (line 450) | def __repr__(self):
  class CaptureStdout (line 465) | class CaptureStdout(CaptureStd):
    method __init__ (line 468) | def __init__(self, replay=True):
  class CaptureStderr (line 472) | class CaptureStderr(CaptureStd):
    method __init__ (line 475) | def __init__(self, replay=True):
  class CaptureLogger (line 479) | class CaptureLogger:
    method __init__ (line 503) | def __init__(self, logger):
    method __enter__ (line 509) | def __enter__(self):
    method __exit__ (line 513) | def __exit__(self, *exc):
    method __repr__ (line 517) | def __repr__(self):
  function ExtendSysPath (line 523) | def ExtendSysPath(path: Union[str, os.PathLike]) -> Iterator[None]:
  class TestCasePlus (line 542) | class TestCasePlus(unittest.TestCase):
    method setUp (line 623) | def setUp(self):
    method test_file_path (line 644) | def test_file_path(self):
    method test_file_path_str (line 648) | def test_file_path_str(self):
    method test_file_dir (line 652) | def test_file_dir(self):
    method test_file_dir_str (line 656) | def test_file_dir_str(self):
    method tests_dir (line 660) | def tests_dir(self):
    method tests_dir_str (line 664) | def tests_dir_str(self):
    method data_dir (line 668) | def data_dir(self):
    method data_dir_str (line 672) | def data_dir_str(self):
    method repo_root_dir (line 676) | def repo_root_dir(self):
    method repo_root_dir_str (line 680) | def repo_root_dir_str(self):
    method src_dir (line 684) | def src_dir(self):
    method src_dir_str (line 688) | def src_dir_str(self):
    method get_env (line 691) | def get_env(self):
    method get_auto_remove_tmp_dir (line 708) | def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None):
    method get_auto_remove_tmp_dir_str (line 777) | def get_auto_remove_tmp_dir_str(self, *args, **kwargs):
    method tearDown (line 780) | def tearDown(self):
  function mockenv (line 787) | def mockenv(**kwargs):
  function mockenv_context (line 804) | def mockenv_context(*remove, **update):
  function get_xdist_worker_id (line 842) | def get_xdist_worker_id():
  function get_unique_port_number (line 853) | def get_unique_port_number():
  function write_file (line 865) | def write_file(file, content):
  function read_json_file (line 870) | def read_json_file(file):
  function replace_str_in_file (line 875) | def replace_str_in_file(file, text_to_search, replacement_text):
  function pytest_addoption_shared (line 935) | def pytest_addoption_shared(parser):
  function pytest_terminal_summary_main (line 954) | def pytest_terminal_summary_main(tr, id):
  class _RunOutput (line 1090) | class _RunOutput:
    method __init__ (line 1091) | def __init__(self, returncode, stdout, stderr):
  function _read_stream (line 1097) | async def _read_stream(stream, callback):
  function _stream_subprocess (line 1106) | async def _stream_subprocess(cmd, env=None, stdin=None, timeout=None, qu...
  function execute_subprocess_async (line 1147) | def execute_subprocess_async(cmd, env=None, stdin=None, timeout=180, qui...

FILE: training/checkpoints/torch-checkpoint-shrink.py
  function get_pt_files (line 27) | def get_pt_files(checkpoint_dir, patterns):
  function shrink_dict_values (line 43) | def shrink_dict_values(d, prefix=""):
  function shrink_pt_file (line 54) | def shrink_pt_file(f):
  function checkpoint_shrink (line 66) | def checkpoint_shrink(checkpoint_dir, patterns):

FILE: training/fault-tolerance/fs-watchdog.py
  function send_email (line 24) | def send_email(subject, body):
  function send_email_alert (line 38) | def send_email_alert(msg):
  function check_running_on_jean_zay (line 54) | def check_running_on_jean_zay():
  function run_cmd (line 61) | def run_cmd(cmd, check=True):
  function get_args (line 76) | def get_args():
  function main (line 82) | def main():

FILE: training/fault-tolerance/slurm-status.py
  function send_email (line 30) | def send_email(subject, body):
  function send_email_alert_job_not_scheduled (line 44) | def send_email_alert_job_not_scheduled(job_name):
  function check_running_on_jean_zay (line 63) | def check_running_on_jean_zay():
  function run_cmd (line 70) | def run_cmd(cmd):
  function get_slurm_group_status (line 85) | def get_slurm_group_status():
  function get_remaining_time (line 101) | def get_remaining_time(time_str):
  function get_preamble (line 112) | def get_preamble():
  function process_job (line 118) | def process_job(jobid, partition, name, state, time, nodes, start_time, ...
  function get_args (line 143) | def get_args():
  function main (line 151) | def main():

FILE: training/performance/benchmarks/dataloader/num-workers-bench.py
  class MyDataset (line 16) | class MyDataset(torch.utils.data.Dataset):
    method __init__ (line 18) | def __init__(self):
    method __len__ (line 21) | def __len__(self):
    method __getitem__ (line 24) | def __getitem__(self, idx):

FILE: training/performance/benchmarks/dataloader/pin-memory-non-block-bench.py
  class MyDataset (line 25) | class MyDataset(torch.utils.data.Dataset):
    method __init__ (line 27) | def __init__(self):
    method __len__ (line 30) | def __len__(self):
    method __getitem__ (line 33) | def __getitem__(self, idx):

FILE: training/performance/benchmarks/matrix-shape/swiglu-maf-bench.py
  function benchmark_bmm (line 48) | def benchmark_bmm(b, m, n, k, num_iterations=100, num_matmuls=1):

FILE: training/performance/benchmarks/numa/numa-set-pynvml.py
  function set_numa_affinity (line 12) | def set_numa_affinity(gpu_index, verbose=False):

FILE: training/performance/distributed/torch-dist-mem-usage.py
  function see_memory_usage (line 24) | def see_memory_usage(message, force=False, ranks=[0]):
  function init_processes (line 72) | def init_processes(local_rank, backend='nccl'):

FILE: training/tools/main_process_first.py
  function get_local_rank (line 14) | def get_local_rank() -> int:
  function get_global_rank (line 17) | def get_global_rank() -> int:
  function is_local_fs (line 27) | def is_local_fs(path):
  function path_to_fs_type (line 36) | def path_to_fs_type(path):
  function is_main_process_by_path (line 52) | def is_main_process_by_path(path):
  function is_local_main_process (line 58) | def is_local_main_process():
  function is_global_main_process (line 61) | def is_global_main_process():
  function _goes_first (line 65) | def _goes_first(is_main: bool):
  function main_process_by_path_first (line 76) | def main_process_by_path_first(path):
  function global_main_process_first (line 103) | def global_main_process_first():
  function local_main_process_first (line 125) | def local_main_process_first():
  function ds_load_emulate (line 157) | def ds_load_emulate():

FILE: training/tools/multi-gpu-non-interleaved-print.py
  function printflock (line 15) | def printflock(*args, **kwargs):

FILE: training/tools/printflock.py
  function printflock (line 22) | def printflock(*args, **kwargs):
Condensed preview — 114 files, each showing path, character count, and a content snippet. Download the .json file or copy for the full structured content (1,094K chars).
[
  {
    "path": ".gitignore",
    "chars": 1831,
    "preview": "# HTML build\n*.html\nchapters-html.txt\n\n# Byte-compiled / optimized / DLL files\n__pycache__/\n*.py[cod]\n*$py.class\n\n# C ex"
  },
  {
    "path": "LICENSE-CC-BY-SA",
    "chars": 20127,
    "preview": "Attribution-ShareAlike 4.0 International\n\n=======================================================================\n\nCreat"
  },
  {
    "path": "Makefile",
    "chars": 2614,
    "preview": "# usage: make help\n\n.PHONY: help spell prep-html-files html html-local pdf epub upload check-links-local check-links-all"
  },
  {
    "path": "README.md",
    "chars": 7882,
    "preview": "# Machine Learning Engineering Open Book\n\nThis is an open collection of methodologies, tools and step by step instructio"
  },
  {
    "path": "build/README.md",
    "chars": 1461,
    "preview": "# Book Building\n\nImportant: this is still a WIP - it mostly works, but stylesheets need some work to make the pdf really"
  },
  {
    "path": "build/linkcheckerrc",
    "chars": 211,
    "preview": "# rtfm https://linkchecker.github.io/linkchecker/man/linkcheckerrc.html\n\n[output]\n\n[text]\ncolorwarning=blue\n\n[AnchorChec"
  },
  {
    "path": "build/mdbook/md-to-html.py",
    "chars": 2910,
    "preview": "import argparse\nimport datetime\nimport re\n\nfrom functools import partial\nfrom markdown_it import MarkdownIt\nfrom mdit_py"
  },
  {
    "path": "build/mdbook/mv-links.py",
    "chars": 934,
    "preview": "\"\"\"\n\nwhen chapters are moved around this script rewrites local relative links\n\npython build/mdbook/mv-links.py slurm orc"
  },
  {
    "path": "build/mdbook/preprocess-html-for-epub.py",
    "chars": 5736,
    "preview": "#!/usr/bin/env python3\n\"\"\"\nPreprocess HTML files for EPUB generation.\nMakes all anchor IDs globally unique by prefixing "
  },
  {
    "path": "build/mdbook/utils/build_utils.py",
    "chars": 152,
    "preview": "from pathlib import Path\n\ndef get_markdown_files(md_chapters_file):\n    return [Path(l) for l in md_chapters_file.read_t"
  },
  {
    "path": "build/mdbook/utils/github_md_utils.py",
    "chars": 7415,
    "preview": "\"\"\"\nThe utils in this module replicate github logic, which means it may or may not work for other markdown\n\"\"\"\n\nimport r"
  },
  {
    "path": "build/prince_style.css",
    "chars": 3293,
    "preview": "/*\n  CSS style sheet for prince html2pdf system (http://www.princexml.com/)\n\n  Here's an example of how to use the style"
  },
  {
    "path": "build/requirements.txt",
    "chars": 60,
    "preview": "codespell\nlinkchecker\nmarkdown-it-py\nmdit-py-plugins\npandoc\n"
  },
  {
    "path": "chapters-md.txt",
    "chars": 1496,
    "preview": "README.md\n\ninsights/ai-battlefield.md\ninsights/how-to-choose-cloud-provider.md\n\ncompute/README.md\ncompute/accelerator/RE"
  },
  {
    "path": "compute/README.md",
    "chars": 257,
    "preview": "# Compute\n\n1. **[Accelerator](accelerator)** - the work horses of ML - GPUs, TPUs, IPUs, FPGAs, HPUs, QPUs, RDUs (WIP)\n\n"
  },
  {
    "path": "compute/accelerator/README.md",
    "chars": 60428,
    "preview": "# Accelerators\n\nCompute accelerators are the workhorses of the ML training. At the beginning there were just GPUs. But n"
  },
  {
    "path": "compute/accelerator/amd/debug.md",
    "chars": 8683,
    "preview": "# Troubleshooting AMD GPUs\n\nXXX: this is very early - collecting various tools/notes\n\nAs most of us are well familiar wi"
  },
  {
    "path": "compute/accelerator/amd/performance.md",
    "chars": 560,
    "preview": "# AMD GPUs Performance\n\nAs I haven't had a chance to do any serious work with AMD GPUs, just sharing links for now.\n\n- ["
  },
  {
    "path": "compute/accelerator/benchmarks/README.md",
    "chars": 8141,
    "preview": "# Accelerator Benchmarks\n\n## Maximum Achievable Matmul FLOPS Finder\n\nMaximum Achievable Matmul FLOPS (MAMF) Benchmark: ["
  },
  {
    "path": "compute/accelerator/benchmarks/mamf-finder.py",
    "chars": 17690,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThis is Maximum Achievable Matmul FLOPS (MAMF) Finder\n\nFor a quick run use:\n\npython mamf-fin"
  },
  {
    "path": "compute/accelerator/nvidia/debug.md",
    "chars": 23151,
    "preview": "# Troubleshooting NVIDIA GPUs\n\n## Glossary\n\n- DBE: Double Bit ECC Error\n- DCGM: (NVIDIA) Data Center GPU Manager\n- ECC: "
  },
  {
    "path": "compute/cpu/README.md",
    "chars": 2998,
    "preview": "# CPU\n\nAs of this writing Machine learning workloads don't use much CPU so there aren't too many things to tell in this "
  },
  {
    "path": "compute/cpu-memory/README.md",
    "chars": 2521,
    "preview": "# CPU memory\n\nThis is a tiny chapter, since usually there are very few nuances one needs to know about CPU memory - whic"
  },
  {
    "path": "contributors.md",
    "chars": 1483,
    "preview": "# Contributors\n\nMultiple contributors kindly helped to improve these ever improving and expanding notes.\n\n1. Some of the"
  },
  {
    "path": "debug/NicerTrace.py",
    "chars": 9475,
    "preview": "\"\"\" NicerTrace - an improved Trace package \"\"\"\n\n\"\"\"\nTo try it in action and to get a sense of how it can help you just r"
  },
  {
    "path": "debug/README.md",
    "chars": 810,
    "preview": "# Debugging and Troubleshooting\n\n\n## Guides\n\n- [Debugging PyTorch programs](./pytorch.md)\n\n- [Diagnosing Hangings and De"
  },
  {
    "path": "debug/make-tiny-models-tokenizers-datasets.md",
    "chars": 20631,
    "preview": "# Faster debug and development with tiny models, tokenizers and datasets\n\nIf you're debugging problems and develop with "
  },
  {
    "path": "debug/nccl-performance-debug.md",
    "chars": 75,
    "preview": "# NCCL: Debug and Performance\n\nmoved to [Network Debug](../network/debug).\n"
  },
  {
    "path": "debug/pytorch.md",
    "chars": 36900,
    "preview": "# Debugging PyTorch programs\n\n\n## Getting nodes to talk to each other\n\nOnce you need to use more than one node to scale "
  },
  {
    "path": "debug/tiny-scripts/README.md",
    "chars": 647,
    "preview": "# A Back up of scripts\n\nThis is a backup of scripts discussed in [Faster debug and development with tiny models, tokeniz"
  },
  {
    "path": "debug/tiny-scripts/c4-en-10k.py",
    "chars": 2622,
    "preview": "# coding=utf-8\n# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.\n#\n# License"
  },
  {
    "path": "debug/tiny-scripts/cm4-synthetic-testing.py",
    "chars": 8259,
    "preview": "\n\"\"\"\n\nThis dataset was generated by:\n\n# select a few seed records so there is some longer and shorter text, records with"
  },
  {
    "path": "debug/tiny-scripts/fsmt-make-super-tiny-model.py",
    "chars": 3263,
    "preview": "#!/usr/bin/env python\n# coding: utf-8\n# Copyright 2020 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the"
  },
  {
    "path": "debug/tiny-scripts/general-pmd-ds-unpack.py",
    "chars": 3142,
    "preview": "# unpack the desired datasets records into a filesystem-based subdir structure which can then be\n# used to create a synt"
  },
  {
    "path": "debug/tiny-scripts/general-pmd-synthetic-testing.py",
    "chars": 6836,
    "preview": "\n\"\"\"\n\nThis dataset was generated by:\n\n# prep dataset repo\nhttps://huggingface.co/new-dataset => HuggingFaceM4/general-pm"
  },
  {
    "path": "debug/tiny-scripts/idefics-make-tiny-model.py",
    "chars": 1545,
    "preview": "#!/usr/bin/env python\n\n# This script creates a smallish random model, with a few layers to test things quickly\n#\n# It al"
  },
  {
    "path": "debug/tiny-scripts/m4-ds-unpack.py",
    "chars": 3553,
    "preview": "# unpack the desired datasets records into a filesystem-based subdir structure which can then be\n# used to create a synt"
  },
  {
    "path": "debug/tiny-scripts/mt5-make-tiny-model.py",
    "chars": 3347,
    "preview": "#!/usr/bin/env python\n# coding: utf-8\n# Copyright 2021 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the"
  },
  {
    "path": "debug/tiny-scripts/openwebtext-10k.py",
    "chars": 3078,
    "preview": "# coding=utf-8\n# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.\n#\n# License"
  },
  {
    "path": "debug/tiny-scripts/oscar-en-10k.py",
    "chars": 4192,
    "preview": "# coding=utf-8\n# Copyright 2020 The HuggingFace Datasets Authors and the current dataset script contributor.\n#\n# License"
  },
  {
    "path": "debug/tools.md",
    "chars": 9790,
    "preview": "# Debug Tools\n\n## git-related tools\n\n\n### Useful aliases\n\nShow a diff of all files modified in the current branch agains"
  },
  {
    "path": "debug/torch-distributed-gpu-test.py",
    "chars": 3856,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThis a `torch.distributed` diagnostics script that checks that all GPUs in the cluster (one "
  },
  {
    "path": "debug/torch-distributed-hanging-solutions.md",
    "chars": 22713,
    "preview": "# Diagnosing Hangings and Deadlocks in Multi-Node Multi-GPU Python Programs\n\nWhile the methodologies found in this artic"
  },
  {
    "path": "debug/underflow_overflow.md",
    "chars": 11089,
    "preview": "# Underflow and Overflow Detection\n\nFor this section we are going to use the [underflow_overflow](./underflow_overflow.p"
  },
  {
    "path": "debug/underflow_overflow.py",
    "chars": 13104,
    "preview": "# Copyright 2020 The HuggingFace Team. All rights reserved.\n#\n# Licensed under the Apache License, Version 2.0 (the \"Lic"
  },
  {
    "path": "inference/README.md",
    "chars": 50790,
    "preview": "# Inference\n\nXXX: this chapter is under construction - some sections are complete, some are still starting out, many are"
  },
  {
    "path": "insights/ai-battlefield.md",
    "chars": 29669,
    "preview": "# The AI Battlefield Engineering - What You Need To Know\n\nThis chapter is one person's opinionated overview of the ML/AI"
  },
  {
    "path": "insights/how-to-choose-cloud-provider.md",
    "chars": 28373,
    "preview": "# How to Choose a Cloud Provider\n\nHaving used multiple compute clouds over long and short terms, and participating in ma"
  },
  {
    "path": "model-parallelism/README.md",
    "chars": 63,
    "preview": "## Moved\n\n**Moved to [here](../training/model-parallelism/).**\n"
  },
  {
    "path": "network/README.md",
    "chars": 66777,
    "preview": "# Inter-node and Intra-Node Networking Hardware\n\n**Subsections**:\n\n- [Communication Patterns](comms.md)\n- [Network Debug"
  },
  {
    "path": "network/benchmarks/README.md",
    "chars": 18869,
    "preview": "# Networking Benchmarks\n\n## Tools\n\n### all_reduce benchmark\n\n[all_reduce_bench.py](all_reduce_bench.py) - a tool to benc"
  },
  {
    "path": "network/benchmarks/all_gather_object_vs_all_gather.py",
    "chars": 1755,
    "preview": "#!/usr/bin/env python\n\n#\n# all_gather to gather counts across process group is 23x faster than the same via all_gather_o"
  },
  {
    "path": "network/benchmarks/all_gather_object_vs_all_reduce.py",
    "chars": 1325,
    "preview": "#!/usr/bin/env python\n\n#\n# all_reduce to gather counts across process group is 23x faster than the same via all_gather_o"
  },
  {
    "path": "network/benchmarks/all_reduce_bench.py",
    "chars": 14222,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThe latest version of this program can be found at https://github.com/stas00/ml-engineering\n"
  },
  {
    "path": "network/benchmarks/all_reduce_bench_pyxis.sbatch",
    "chars": 728,
    "preview": "#!/bin/bash\n#SBATCH --job-name=all_reduce_bench_pyxis\n#SBATCH --nodes=2\n#SBATCH --ntasks-per-node=1\n#SBATCH --gres=gpu:8"
  },
  {
    "path": "network/benchmarks/all_reduce_latency_comp.py",
    "chars": 2779,
    "preview": "#!/usr/bin/env python\n\n# this is derived from the all_reduce_bench.py\n# but adjusted to show how 1x 4GB reduction is muc"
  },
  {
    "path": "network/benchmarks/results/README.md",
    "chars": 70,
    "preview": "# Network Benchmarks Results\n\n- [Disabling NVLink](disable-nvlink.md)\n"
  },
  {
    "path": "network/benchmarks/results/disable-nvlink.md",
    "chars": 1660,
    "preview": "# Disabling NVLink Benchmark\n\nLet's compare the training of a gpt2 language model training over a small sample of wikite"
  },
  {
    "path": "network/comms.md",
    "chars": 8635,
    "preview": "# Communication Patterns\n\nThe intention of this chapter is not to show code examples and explain APIs for which there ar"
  },
  {
    "path": "network/debug/README.md",
    "chars": 17544,
    "preview": "# Network Debug\n\nOften you don't need to be a network engineer to figure out networking issues. Some of the common issue"
  },
  {
    "path": "orchestration/README.md",
    "chars": 2443,
    "preview": "# Orchestration\n\nThere are many container/accelerator orchestration solutions - many of which are open source.\n\nSo far I"
  },
  {
    "path": "orchestration/kubernetes/README.md",
    "chars": 2468,
    "preview": "# Kubernetes\n\nIn general IMHO kubernetes is the wrong environment for doing training work, with [SLURM](../slurm/) being"
  },
  {
    "path": "orchestration/slurm/README.md",
    "chars": 1277,
    "preview": "# Working in SLURM Environment\n\nUnless you're lucky and you have a dedicated cluster that is completely under your contr"
  },
  {
    "path": "orchestration/slurm/admin.md",
    "chars": 4112,
    "preview": "# SLURM Administration\n\n\n## Run a command on multiple nodes\n\n1. to avoid being prompted with:\n```\nAre you sure you want "
  },
  {
    "path": "orchestration/slurm/cron-daily.slurm",
    "chars": 690,
    "preview": "#!/bin/bash\n#SBATCH --job-name=cron-daily        # job name\n#SBATCH --ntasks=1                   # number of MP tasks\n#S"
  },
  {
    "path": "orchestration/slurm/cron-hourly.slurm",
    "chars": 692,
    "preview": "#!/bin/bash\n#SBATCH --job-name=cron-hourly       # job name\n#SBATCH --ntasks=1                   # number of MP tasks\n#S"
  },
  {
    "path": "orchestration/slurm/example.slurm",
    "chars": 2184,
    "preview": "#!/bin/bash\n\n# this is a 2 node slurm job example, you will most likely need to adapt --cpus-per-task and --partition\n\n#"
  },
  {
    "path": "orchestration/slurm/launchers/README.md",
    "chars": 1077,
    "preview": "# Single and Multi-node Launchers with SLURM\n\nThe following are complete SLURM scripts that demonstrate how to integrate"
  },
  {
    "path": "orchestration/slurm/launchers/accelerate-launcher.slurm",
    "chars": 3049,
    "preview": "#!/bin/bash\n\n# this is a 2 node SLURM script using `accelerate` launcher\n# Important: you will need to adapt setting whe"
  },
  {
    "path": "orchestration/slurm/launchers/lightning-launcher.slurm",
    "chars": 2146,
    "preview": "#!/bin/bash\n\n# this is a 2 node SLURM script for launching Lightning-based programs\n# Important: you will need to adapt "
  },
  {
    "path": "orchestration/slurm/launchers/srun-launcher.slurm",
    "chars": 2644,
    "preview": "#!/bin/bash\n\n# this is a 2 node SLURM script for launching srun-based programs\n# Important: you will need to adapt setti"
  },
  {
    "path": "orchestration/slurm/launchers/torchrun-launcher.slurm",
    "chars": 2769,
    "preview": "#!/bin/bash\n\n# this is a 2 node SLURM script using `torchrun` launcher\n# Important: you will need to adapt setting where"
  },
  {
    "path": "orchestration/slurm/performance.md",
    "chars": 3597,
    "preview": "# SLURM Performance\n\nHere you will find discussions of SLURM-specific settings that impact performance.\n\n## srun's `--cp"
  },
  {
    "path": "orchestration/slurm/undrain-good-nodes.sh",
    "chars": 1880,
    "preview": "#!/bin/bash\n\n# When nodes get auto placed in drain because SLURM fails to wait till all last job's processes are killed "
  },
  {
    "path": "orchestration/slurm/users.md",
    "chars": 44793,
    "preview": "# SLURM for users\n\n## Quick start\n\nSimply copy this [example.slurm](./example.slurm) and adapt it to your needs.\n\n## SLU"
  },
  {
    "path": "resources/README.md",
    "chars": 5309,
    "preview": "# Resources\n\n## Similar online guides\n\n- Boris Dayma wrote [A Recipe for Training Large Models](https://wandb.ai/craiyon"
  },
  {
    "path": "stabs/README.md",
    "chars": 111,
    "preview": "# Stabs\n\nSome very early notes on various topics, not meant for reading or fixing. Please ignore this sub-dir.\n"
  },
  {
    "path": "stabs/incoming.md",
    "chars": 11024,
    "preview": "# Things to add / integrate\n\n# pdf book notes\n\nideas from Sam: https://github.com/saforem2: https://github.com/stas00/ml"
  },
  {
    "path": "storage/README.md",
    "chars": 49903,
    "preview": "# Storage: File Systems and IO\n\n## Machine Learning IO needs\n\nFor training workloads there are 3 distinct IO needs:\n\n1. "
  },
  {
    "path": "storage/benchmarks/results/hope-2023-12-20-14-37-02-331702-summary.md",
    "chars": 971,
    "preview": "# fio benchmark results for hope on 2023-12-20-14:37:02\n\npartition /mnt/nvme0/fio/fio-test\n\n\n*  filesize=16k read\n\n| lat"
  },
  {
    "path": "storage/fio-json-extract.py",
    "chars": 1054,
    "preview": "#!/bin/env python\n\n#\n# usage:\n#\n# ./fio-json-extract.py fio-json-file.json\n#\n# The script expects an fio-generated json "
  },
  {
    "path": "storage/fio-scan",
    "chars": 2822,
    "preview": "#!/bin/bash\n\n# This script will run fio on a given partition/path for read/write for 16KB, 1MB and 1GB file sizes\n# usin"
  },
  {
    "path": "testing/README.md",
    "chars": 31772,
    "preview": "# Writing and Running Tests\n\nNote: a part of this document refers to functionality provided by the included [testing_uti"
  },
  {
    "path": "testing/testing_utils.py",
    "chars": 38139,
    "preview": "# I developed the bulk of this library while I worked at HF\n\n# Copyright 2020 The HuggingFace Team. All rights reserved."
  },
  {
    "path": "todo.md",
    "chars": 214,
    "preview": "# TODO\n\nAlso see [stabs](./stabs)\n\n- re-run all-reduce bench and update plot+table as the bench switched to KiB/MiB/etc."
  },
  {
    "path": "training/README.md",
    "chars": 1101,
    "preview": "# Training\n\n**Subsections**:\n\n- [Model parallelism](model-parallelism)\n\n- [Performance](performance)\n\n- [Fault Tolerance"
  },
  {
    "path": "training/checkpoints/README.md",
    "chars": 578,
    "preview": "# Checkpoints\n\n- [torch-checkpoint-convert-to-bf16](./torch-checkpoint-convert-to-bf16) - converts an existing fp32 torc"
  },
  {
    "path": "training/checkpoints/torch-checkpoint-convert-to-bf16",
    "chars": 1348,
    "preview": "#!/bin/bash\n\n# this script converts torch's *bin and safetensor *safetensor files to bf16 creating a new checkpoint unde"
  },
  {
    "path": "training/checkpoints/torch-checkpoint-shrink.py",
    "chars": 3306,
    "preview": "#!/usr/bin/env python\n\n# This script fixes checkpoints which for some reason stored tensors with storage larger than the"
  },
  {
    "path": "training/datasets.md",
    "chars": 2535,
    "preview": "# Dealing with datasets\n\n## Preprocessing and caching datasets on the main process\n\nHF Accelerate has a very neat contai"
  },
  {
    "path": "training/dtype.md",
    "chars": 6649,
    "preview": "# Tensor precision / Data types\n\nThese are the common datatypes that are used as of this writing in ML (usually referred"
  },
  {
    "path": "training/emulate-multi-node.md",
    "chars": 10498,
    "preview": "# Emulate a multi-node setup using just a single node\n\nThe goal is to emulate a 2-node environment using a single node w"
  },
  {
    "path": "training/fault-tolerance/README.md",
    "chars": 39230,
    "preview": "# Fault Tolerance\n\nRegardless of whether you own the ML training hardware or renting it by the hour, in this ever speedi"
  },
  {
    "path": "training/fault-tolerance/fs-watchdog.py",
    "chars": 7404,
    "preview": "#!/usr/bin/env python\n\n#\n# This tool alerts on the status of the filesystem - when it's getting close to running out of "
  },
  {
    "path": "training/fault-tolerance/fs-watchdog.slurm",
    "chars": 614,
    "preview": "#!/bin/bash\n#SBATCH --job-name=fs-watchdog       # job name\n#SBATCH --ntasks=1                   # number of MP tasks\n#S"
  },
  {
    "path": "training/fault-tolerance/slurm-status.py",
    "chars": 6213,
    "preview": "#!/usr/bin/env python\n\n#\n# This tool reports on the status of the job - whether it's running or scheduled and various ot"
  },
  {
    "path": "training/fault-tolerance/slurm-status.slurm",
    "chars": 975,
    "preview": "#!/bin/bash\n#SBATCH --job-name=tr11-176B-ml      # job name\n#SBATCH --ntasks=1                   # number of MP tasks\n#S"
  },
  {
    "path": "training/hparams.md",
    "chars": 2321,
    "preview": "# Selecting Training Hyper-Parameters And Model Initializations\n\nThe easiest way to find a good hparam and model init st"
  },
  {
    "path": "training/instabilities/README.md",
    "chars": 4777,
    "preview": "# Avoiding, Recovering From and Understanding Instabilities\n\nSub-sections:\n\n* [Understanding Training Loss Patterns](tra"
  },
  {
    "path": "training/instabilities/training-loss-patterns.md",
    "chars": 10303,
    "preview": "# Understanding Training Loss Patterns\n\nTraining loss plot is similar to the heart beat pattern - there is the good, the"
  },
  {
    "path": "training/model-parallelism/README.md",
    "chars": 54139,
    "preview": "# Model Parallelism\n\n\n## Parallelism overview\n\nIn the modern machine learning the various approaches to parallelism are "
  },
  {
    "path": "training/performance/README.md",
    "chars": 62985,
    "preview": "# Software Tune Up For The Best Performance\n\nThe faster you can make your model to train the sooner the model will finis"
  },
  {
    "path": "training/performance/benchmarks/activation-memory-per-layer.py",
    "chars": 1765,
    "preview": "#!/usr/bin/env python\n\n# This script derives the coefficient num_of_hidden_states_copies in `num_of_hidden_states_copies"
  },
  {
    "path": "training/performance/benchmarks/dataloader/num-workers-bench.py",
    "chars": 1635,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThis benchmark shows that num_workers>0 leads to a better performance\n\nusage:\n\n./num-workers"
  },
  {
    "path": "training/performance/benchmarks/dataloader/pin-memory-non-block-bench.py",
    "chars": 2155,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThis benchmark shows that a combo of:\n\n(1) DataLoader(pin_memory=True, ...)\n(2) batch.to(dev"
  },
  {
    "path": "training/performance/benchmarks/matrix-shape/swiglu-maf-bench.py",
    "chars": 3464,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThis script will help you find the intermediate value of the hidden layer of the MLP when Sw"
  },
  {
    "path": "training/performance/benchmarks/numa/numa-set-pynvml.py",
    "chars": 1967,
    "preview": "# this helper util will assign the cpu-cores belonging to the same NUMA node as the GPU\n\n# derived from\n# https://github"
  },
  {
    "path": "training/performance/benchmarks/numa/numa-set.sh",
    "chars": 1011,
    "preview": "#!/usr/bin/bash\n\n# this helper util performs NUMA node binding which can be used with torchrun, and other launchers\n# co"
  },
  {
    "path": "training/performance/distributed/torch-dist-mem-usage.py",
    "chars": 3386,
    "preview": "#!/usr/bin/env python\n\n\"\"\"\n\nThis script demonstrates that when using `torch.distributed` a few GBs of GPU memory is take"
  },
  {
    "path": "training/re-train-hub-models.md",
    "chars": 2529,
    "preview": "# Re-train HF Hub Models From Scratch Using Finetuning Examples\n\nHF Transformers has awesome finetuning examples  https:"
  },
  {
    "path": "training/reproducibility/README.md",
    "chars": 6194,
    "preview": "# Reproducibility\n\n## Achieve determinism in randomness based software\n\nWhen debugging always set a fixed seed for all t"
  },
  {
    "path": "training/tools/main_process_first.py",
    "chars": 6386,
    "preview": "\"\"\"\nTooling for dealing with efficient dataset loading in a multi-process, potentially multi-node environment with share"
  },
  {
    "path": "training/tools/multi-gpu-non-interleaved-print.py",
    "chars": 1225,
    "preview": "#!/usr/bin/env python\n\n# printflock allows one to print in a non-interleaved fashion when printing from multiple procese"
  },
  {
    "path": "training/tools/printflock.py",
    "chars": 1903,
    "preview": "# If you have ever done multi-gpu work and tried to `print` for debugging you quickly discovered\n# that some messages ge"
  }
]

About this extraction

This page contains the full source code of the stas00/ml-engineering GitHub repository, extracted and formatted as plain text for AI agents and large language models (LLMs). The extraction includes 114 files (1.0 MB), approximately 269.3k tokens, and a symbol index with 257 extracted functions, classes, methods, constants, and types. Use this with OpenClaw, Claude, ChatGPT, Cursor, Windsurf, or any other AI tool that accepts text input. You can copy the full output to your clipboard or download it as a .txt file.

Extracted by GitExtract — free GitHub repo to text converter for AI. Built by Nikandr Surkov.

Copied to clipboard!